silencial https://silencial.github.io/images/favicon-32x32-next.ico 2021-05-13T00:00:00.000Z https://silencial.github.io/ silencial Hexo Nonlinear Control and Planning in Robotics https://silencial.github.io/nonlinear-control/ 2021-05-13T00:00:00.000Z 2021-05-13T00:00:00.000Z EN530.678 Review.

# System Models

## Constraints

The configurations space a mechanical system is denoted by $$Q$$ and is assumed to be an $$n$$-dimensional manifold, locally isomorphic to $$\mathbb{R}^n$$. A configuration is denoted by $$q\in Q$$

Holonomic (or geometric) constraints: restrict possible motions to a $$n-k$$ dimensional sub-manifold $h_i(q) = 0, \quad i = 1,\dots,k$ Linear (Pfaffian) nonholonomic (or kinematic): restrict the direction of velocities but it is still possible to reach any configuration in $$Q$$ $A^T(q) \dot{q} = 0$ Nonholonomic constraints are not integrable, i.e. it is not possible to find $$k$$ functions $$h_i$$ s.t. $\nabla_{q} h_{i}(q)=a_{i}(q), \quad i=1, \ldots, k$ where $$c$$ is a constant

## Dynamics

### holonomic systems

Assume the system has a Lagrangian $L(q, \dot{q})=\frac{1}{2} \dot{q}^{T} M(q) \dot{q}-V(q)$ with inertial matrix $$M(q) \succ 0$$ and potential energy $$V(q)$$. The system is subject to external forces $$f_\text{ext}(q,\dot{q})\in \mathbb{R}^n$$ and control inputs $$u\in \mathbb{R}^m$$. The equations of motions are given by $\frac{d}{d t} \nabla_{\dot{q}} L-\nabla_{q} L=f_{\text{ext}}(q, \dot{q})+B(q) u$ where $$B(q)\in\mathbb{R}^{n\times m}$$ is a matrix mapping from $$m$$ control inputs to the forces/torques acting on the generalized coordinates $$q$$.

The actual equations take the form $M(q) \ddot{q}+b(q, \dot{q})=B(q) u$ where $\begin{equation}b(q, \dot{q})=\dot{M}(q) \dot{q}-\frac{1}{2} \nabla_{q}\left(\dot{q}^{T} M(q) \dot{q}\right)+\nabla_{q} V(q)-f_{\text{ext}}(q, \dot{q})\label{bq}\end{equation}$

### nonholonomic systems

Similar to holonomic systems but the Euler-Lagrange equations become $\frac{d}{d t} \nabla_{\dot{q}} L-\nabla_{q} L=A(q) \lambda+f_{\text{ext}}(q, \dot{q})+B(q) u$ where $$\lambda\in\mathbb{R}^k$$ is a vector of the Lagrange multipliers and the extra term $$A(q)\lambda$$ denotes the force that counters any motion in directions spanned by $$A(q)$$.

The actual equations take the form $M(q) \ddot{q}+b(q, \dot{q})=A(q)\lambda + B(q) u \\A^T(q)\dot{q} = 0$ where $$b(q)$$ has the same form as $$\eqref{bq}$$.

The Lagrange multipliers can be eliminated by noting $$A^T(q)G(q) = 0$$ and multiplying the dynamics by $$G^T(q)$$ to obtain a reduced set of $$m=n-k$$ differential equations: $G^{T}(q)(K(q) \ddot{q}+n(q, \dot{q}))=G^{T}(q) S(q) u$ A standard assumption will be that $$G^T(q)S(q)\ne 0$$ or that all feasible directions are controllable. The final equations are $\dot{q}=G(q) v \\K(q) \dot{v}+n(q, v)=S(q) u$ where $K(q)=G^{T}(q) M(q) G(q)>0 \\n(q, v)=G^{T} M(q) \dot{G}(q) v+G^{T}(q) b(q, G(q) v) \\S(q)=G^{T}(q) B(q)$

# Manifolds and Vector Fields

## Manifolds

A manifold is a set $$M$$ that locally "looks like" linear space, e.g. $$\mathbb{R}^n$$. A chart on $$M$$ is a subset $$U$$ of $$M$$ together with a bijective map $$\varphi:U\to \varphi(U) \subset \mathbb{R}^n$$. Two charts $$U, \varphi$$ and $$U', \varphi'$$ s.t. $$U\cap U'\ne \emptyset$$ are called compatible if $$\varphi(U \cap U')$$ and $$\varphi'(U\cap U')$$ are open subsets of $$\mathbb{R}^n$$ and the maps $\left.\varphi^{\prime} \circ \varphi^{-1}\right|_{\varphi\left(U \cap U^{\prime}\right)}: \varphi\left(U \cap U' \right) \rightarrow \varphi^{\prime}\left(U \cap U'\right) \\\left.\varphi \circ\left(\varphi^{\prime}\right)^{-1}\right|_{\varphi^{\prime}\left(U \cap U^{\prime}\right)}: \varphi^{\prime}\left(U \cap U' \right) \rightarrow \varphi\left(U \cap U'\right)$ are $$C^\infty$$ We call $$M$$ a differentiable $$n$$-manifold when:

1. The set $$M$$ is covered by a collections of charts, that is, every point is represented in at least one chart
2. $$M$$ has an atlas; that is, $$M$$ can be written as a union of compatible charts

## Tangents

Two curves $$t\to c_1(t)$$ and $$t\to c_2(t)$$ in an $$n$$-manifold $$M$$ are called equivalent at the point $$m$$ if $c_1(0) = c_2(0) = m$ and $\left.\frac{d}{d t}\left(\varphi \circ c_{1}\right)\right|_{t=0}=\left.\frac{d}{d t}\left(\varphi \circ c_{2}\right)\right|_{t=0}$ in some chart $$\varphi$$

A tangent vector $$v$$ to a manifold $$M$$ at point $$m$$ is an equivalent class of curves at $$m$$. The set of tangent vectors to $$M$$ at $$m$$ is a vector space, denoted by $$T_m M$$

The tangent bundle of $$M$$ denoted by $$TM$$ is the disjoint union of the tangent spaces to $$M$$ at the points $$m\in M$$, i.e. $T M=\bigcup_{m \in M} T_{m} M$

## Vector fileds

A vector field $$X$$ on $$M$$ is a map $$X: M \to TM$$ that assigns a vector $$X(m)$$ at the point $$m\in M$$.

An integral curve of $$X$$ with initial condition $$m_0$$ at $$t=0$$ is a map $$c: (a,b) \to M$$ s.t. $$(a,b)$$ is an open interval containing $$0$$, $$c(0) = m_0$$ and $$c'(t) = X(c(t))$$ for all $$t\in (a,b)$$.

The flow of $$X$$ is a collection of maps $$\Phi_t: M\to M$$ s.t. $$t\to \Phi_t(m)$$ is the integral curve of $$X$$ with initial condition $$m$$.

## Lie bracket

Given two vector fields $$g_1(x)$$ and $$g_2(x)$$, do their flows commute? e.g. $$\Phi_{t}^{g_{2}} \circ \Phi_{t}^{g_{1}}=\Phi_{t}^{g_{1}} \circ \Phi_{t}^{g_{2}}$$

The difference is quantified by $\Phi_{t}^{-g_{2}} \circ \Phi_{t}^{-g_{1}} \circ \Phi_{t}^{g_{2}} \circ \Phi_{t}^{g_{1}}\left(x_{0}\right)=x_{0}+t^{2}\left[g_{1}, g_{2}\right]+O\left(t^{3}\right)$ where $$[g_1, g_2]$$ is the Lie bracket defined by $\left[g_{1}, g_{2}\right]=\frac{\partial g_{2}}{\partial x} g_{1}-\frac{\partial g_{1}}{\partial x} g_{2}$ or as applied to a function $$\alpha$$ by $\left[g_{1}, g_{2}\right] \alpha=g_{1}\left(g_{2} \alpha\right)-g_{2}\left(g_{1} \alpha\right)$

A vector space $$V$$ with a bilinear operator $$[\cdot, \cdot]: V\times V\to V$$ is called a Lie algebra when satisfying:

1. Skew-symmetry: $$[v, w] = -[w, v]$$ for all $$v, w \in V$$
2. Jacobi identity: $$[[v, w], z]+[[z, v], w]+[[w, z], v]=0$$

# Distributions and Controllability

## Distributions

• Distributions determine possible directions of motion
• Controllability determines which states can be reached

Let $$g_1(x), \dots, g_m(x)$$ be linearly independent vector fields on $$M$$

A distribution $$\Delta$$ assigns a subspace of the tangent space to each point defined by $\Delta=\operatorname{span}\left\{g_{1}, \dots, g_{m}\right\}$ A distribution $$\Delta$$ is involutive if it is closed under the Lie bracket, i.e. $\forall f(x), g(x) \in \Delta(x), \quad[f(x), g(x)] \in \Delta(x)$ A distribution $$\Delta$$ is regular if the dimension of $$\Delta(x)$$ does not vary with $$x$$

A distribution $$\Delta$$ of constant dimension $$k$$ is integrable if $$\forall x\in\mathbb{R}^n$$ there are smooth functions $$h_i: \mathbb{R}^n \to \mathbb{R}$$ s.t. $$\frac{\partial h_i}{\partial x}$$ are linearly independent at $$x$$ and $$\forall f\in\Delta$$ we have $L_{f} h_{i}=\frac{\partial h_{i}}{\partial x} f(x)=0, i=1, \ldots, n-k$ The integral manifolds of the distribution is defined as the level sets $\left\{q: h_{1}(x)=c_{1}, \ldots, h_{n-k}(x)=c_{n-k}\right\}$ Frobenius Theorem: A regular distribution is integrable iff it is involutive

## Controllability

### reachable sets

Consider the nonlinear control system (NCS) with $$x\in\mathbb{R}^n$$ and $$u\in U \subset \mathbb{R}^m$$: $\begin{equation}\quad \dot{x}=g_{0}(x)+\sum_{i=1}^{m} g_{i}(x) u_{i}\label{ncs}\end{equation}$ A system is controllable if $$\forall x_0, x_f \in \mathbb{R}^n$$ there exists a time $$T$$ and $$u:[0, T] \to U$$ s.t $$\eqref{ncs}$$ satisfies $$x(0)=x_0$$ and $$x(T)=x_f$$

A system is small-time locally controllable (STLC) at $$x_0$$ if it can reach nearby points in arbitrary small times and stay near $$x_0$$

The reachable set $$\mathcal{R}^V(x_0, T)$$ is the set of states $$x(T)$$ for which there is a control $$u:[0, T]\to U$$ that steers the system from $$x(0)$$ to $$x(T)$$ without leaving an open set $$V$$ around $$x_0$$

The set of states reachable up to time $$T$$ is defined by $\mathcal{R}^{V}\left(x_{0}, \leq T\right)=\bigcup_{0<\tau \leq T} \mathcal{R}^{V}\left(x_{0}, \tau\right)$

### controllability conditions

NCS is locally accessible (LA) from $$x_0$$ if for all neighborhood $$V$$ of $$x_0$$ and $$T > 0$$ there exists an open set $$\Omega$$ s.t. $\Omega \subset \mathcal{R}^{V}\left(x_{0}, \leq T\right)$ NCS is STLC if for all neighborhood $$V$$ of $$x_0$$ and $$T >0$$, $$\mathcal{R}^{V}(x_{0}, T)$$ contains a neighborhood of $$x_0$$

STLC $$\Rightarrow$$ controllable $$\Rightarrow$$ LA

NCS is LA from $$x_0$$ iff $$\dim\bar{\Delta}(x_0)=n$$ (LA rank condition (LARC)) where $\bar{\Delta}=\operatorname{span}\left\{v \in \bigcup_{k \geq 0} \Delta^{k}\right\} \text { with }\begin{cases}\Delta^{0}=\operatorname{span}\left\{g_{0}, g_{1}, \ldots, g_{m}\right\} \\\Delta^{k}=\Delta^{k-1}+\operatorname{span}\left\{\left[g_{j}, v\right], j=0, \ldots, m; v \in \Delta^{k-1}\right\}\end{cases}$ For driftless control systems ($$g_0 = 0$$), STLC $$\Leftrightarrow$$ controllable $$\Leftrightarrow$$ LA. The equivalence also holds when $$g_0(x) \in \operatorname{span}\{g_{1}, \ldots, g_{m}\}$$ (trivial drift)

If the driftless control system $\dot{q} = \sum_{i=1}^m g_i(q) v_i$ with state $$q$$ and inputs $$v$$ is controllable, then its dynamic extension \begin{aligned}&\dot{q}=\sum_{i=1}^{m} g_{i}(q) v_{i} \\&\dot{v}_{i}=u_{i}, \quad i=1, \ldots, m\end{aligned} with state $$x=(q,v)$$ and controls $$u$$ is also controllable (vice versa)

The degree of nonholonomyis defined as the smallest $$k$$ s.t. $\dim \Delta^{k+1}=\dim \Delta^{k}$

• Completely nonholonomic: $$\dim \Delta^k = n$$
• Partially nonholonomic: $$m < \dim \Delta^k < n$$
• Holonomic: $$\dim \Delta^k = m = n-k$$

### good and bad brackets

A bad bracket is a Lie bracket generated using an odd number of $$g_0$$ and even number of $$g_i$$ vectors. A good bracket is one that is not bag.

NCS $$\eqref{ncs}$$ is STLC at $$x^*$$ if

1. $$g_0(x^*) = 0$$
2. $$U$$ is open and its convex hull contains $$0$$
3. LARC is satisfied using brackets of degree $$k$$
4. Any bad bracket of degree $$j \le k$$ can be expressed as linear combinations of good brackets of degree $$< j$$

# Stabilizability and Chained Forms

Given a NCS $\begin{equation}\dot{x} = f(x,u)\label{ncs_simple}\end{equation}$ the goal is to construct a control law $$u=K(x)$$ s.t.

• Stabilization: an equilibrium point $$x_e$$ is made asymptotically stable, or
• Tracking: a desired feasible trajectory $$x_d(t)$$ is asymptotically stable

The linear approximation of the system at $$x_e$$ is $\delta\dot{x}=A \delta x+B \delta u \quad (\delta x=x-x_{e},\ \delta u=u-u_{e}) \\A \triangleq \partial_x f(x_e, u_e), \qquad B \triangleq \partial_u f(x_e, u_e)$ The NCS can be locally smoothly stabilized at $$x_e$$ using $$\delta u = K\delta x$$ if the linearized system is controllable

Necessary conditions by Brockett's theorem: If $$\eqref{ncs_simple}$$ is locally asymptotically $$C^1$$-stabilizable ($$u$$ is $$C^1$$-smooth) at $$x_e = 0$$ then the image of the map $f: \mathbb{R}^{n} \times U \rightarrow \mathbb{R}^{n}$ contains some neighborhood of $$x_e$$

Nonholonomic mechanical systems cannot be stabilized at a point by smooth feedback, alternatives are

1. Time-varying feedback $$u = K(t,x)$$
2. Non-smooth feedback

# Trajectory generation

## Differential flatness

A system is differentially flat if one can find a set of outputs (equal to the number of inputs) which completely determine the whole state and the inputs without the need to integrate the system. More formally, a system with state $$x\in\mathbb{R}^n$$ and inputs $$u\in\mathbb{R}^m$$ is differentially flat if one can find outputs $$y\in\mathbb{R}^m$$ of the form $y = h(x, u, \dot{u}, \dots, u^{(a)})$ s.t. $x=\varphi(y, \dot{y}, \ldots, y^{(b)}) \\u=\alpha(y, \dot{y}, \ldots, y^{(c)})$ The coordinates $$y$$ are called flat outputs

## Trajectory generation

Consider the problem of generating a trajectory between two given states $\dot{x} = f(x,u), \qquad x(0) = x_0,\ x(T) = x_f$ The boundary conditions of a differentially flat systems are expressed as $\begin{array}{l}x(0)=\varphi\left(y(0), \dot{y}(0), \ldots, y^{(b)}(0)\right) \\x(T)=\varphi\left(y(T), \dot{y}(T), \ldots, y^{(b)}(T)\right)\end{array}$ The general strategy is to assume a parametric form $y(t) = A\lambda(t)$ where $$\lambda(t)\in\mathbb{R}^N$$ are basis functions and $$A\in \mathbb{R}^{m\times N}$$ is a constant matrix that satisfies the boundary conditions: $Y = A \Lambda$ where $$Y\in\mathbb{R}^{m\times 2(b+1)}$$ and $$\Lambda\in\mathbb{R}^{N\times 2(b+1)}$$ are $Y = \left[y(0), \dots, y^{(b)}(0), y(T), \dots, y^{(b)}(T)\right] \\\Lambda = \left[\lambda(0), \dots, \lambda^{(b)}(0), \lambda(T), \dots, \lambda^{(b)}(T)\right]$ It is necessary that $$N\ge 2(b+1)$$

# Feedback Linearization

## Input-output linearization

Consider the nonlinear control system $\dot{x}=f(x)+G(x) u \\y=h(x)$ Input-output linearization is to use transformation of $$u$$ so that the input-output response between $$v$$ and $$y$$ is linear.

Static feedback: $u=a(x) + B(x)v$ where $$B(x)$$ is a nonsingular matrix, and $$v$$ is called virtual input

Dynamic feedback: $u=a(x, \xi)+B(x, \xi) v \\\dot{\xi}=c(x, \xi)+D(x, \xi) v$ where $$\xi$$ is the compensator state

## Static feedback

### fully-actuated manipulator control

$\begin{equation}M(q) \ddot{q}+b(q, \dot{q})=u\label{fully-actuated}\end{equation}$

Assume that one is interested in tracking a desired path $$q_d(t)$$. The task is specified by the output $y=q$ Choose the virtual input $$v = \ddot{q}$$ and control law $v=\ddot{q}_{d}-k_{d}\left(\dot{q}-\dot{q}_{d}\right)-k_{p}\left(q-q_{d}\right)$ which result in a closed-loop linear dynamics $\dot{z} = Az \\z=\begin{pmatrix}q-q_{d} \\\dot{q}-\dot{q}_{d}\end{pmatrix} \\A=\begin{pmatrix}0 & I \\-k_p I & -k_d I\end{pmatrix}$ where $$A$$ is Hurwitz. The virtual controls are mapped back to the original input $$u$$ using $$\eqref{fully-actuated}$$.

In general, whenever $$\dim(Q) = \dim(U)$$ one can always use nonlinear static feedback to achieve linearization.

### partial feedback linearization

Consider an underactuated system $M(q) \ddot{q}+b(q, \dot{q})=\begin{pmatrix}0 \\ u\end{pmatrix}$

which can be equivalently expressed as $\begin{bmatrix}M_{11} & M_{12} \\M_{21} & M_{22}\end{bmatrix}\begin{bmatrix}\ddot{q_1} \\\ddot{q_2}\end{bmatrix} +\begin{bmatrix}b_1 \\ b_2\end{bmatrix} =\begin{bmatrix}0 \\ u\end{bmatrix}$ Collocated input/output linearization: If the output is $$y=q_2\in\mathbb{R}^m$$, we can get $\bar{M}_{22}\ddot{q}_2 + \bar{b}_2 = u$ where $\bar{M}_{22} = M_{22} - M_{21}M_{11}^{-1}M_{12} \\\bar{b}_2 = b_2 - M_{21}M_{11}^{-1}b_1$ The rest is similar as the fully actuated control case.

The $$q_1$$ is the remaining "non-linearized" coordinates. We can define $\eta = \begin{bmatrix}\eta_1 \\ \eta_2\end{bmatrix} =\begin{bmatrix}q_1 \\ \dot{q}_1\end{bmatrix}$ and we have the dynamics $\dot{\eta}_1 = \eta_2 \\\dot{\eta}_2 = -M_{11}^{-1}(M_{12}(\ddot{q}_d - k_p z_1 - k_d z_2) + b_1)$ and the complete system can be written as \begin{equation}\begin{aligned}&\dot{z} = Az &&\text{linearized} \\&\dot{\eta} = w(t, z,\eta) &&\text{non-linearized}\end{aligned}\label{collocated}\end{equation} The zero dynamics of the system is defined as the evolution of the non-linearized part after the linear part has stabilized, i.e. $\dot{\eta} = w(t, 0, \eta)$ Suppose $$w(t, 0,\eta_0) = 0$$ for $$t\ge 0$$, i.e. $$(0, \eta_0)$$ is the equilibrium of the full system $$\eqref{collocated}$$, and $$A$$ is Hurwitz. Then $$(0, \eta_0)$$ is locally stable if $$\eta_0$$ is locally stable for the zero dynamics. (respectively, locally asymptotically stable, unstable)

Non-collocated input/output linearization: If the output is $$y = q_1\in \mathbb{R}^l$$, then it can only be linearized when $$l\le m$$ and $$\operatorname{rank}(M_{12}) = l$$. We can get $\tilde{M}_{21} \ddot{q}_1 + \tilde{b}_2 = u$ where $\tilde{M}_{21} = M_{21} = M_{22}M_{12}^\dagger M_{11} \\\tilde{b}_2 = b_2 - M_{22}M_{12}^\dagger b_1$ $$M_{12}^\dagger = M_{12}^T(M_{12} M_{12}^T)^{-1}$$ is the right pseudo-inverse of $$M_{12}$$

The rest is similar.

## Dynamic feedback

If a control system is differentially flat then it is dynamic feedback linearizable on an open dense set, with the dynamic feedback possibly depending explicitly on time

## General case

### SISO system

Consider $\dot{x} = f(x) + g(x) u \\y = h(x)$ where $$u\in\mathbb{R}$$. Let $$x^*$$ be the equilibrium, then we have $\dot{y} = \nabla h^T [f(x) + g(x) u] = L_f h + u L_g h$

If $$|L_g h| \ne 0$$ $u = \frac{1}{L_g h}(-L_f h + v) \\\dot{y} = v$

More generally, we can keep differentiating $$\dot{y}, \ddot{y}, \dots$$ and denote $$\gamma$$ to be the smallest integer s.t. $L_gL_f^i h = 0 \qquad (i = 0, \dots, \gamma - 2) \\L_gL_f^{\gamma -1} h \ne 0$

Then we have $u = \frac{1}{L_gL_f^{\gamma-1}h}(-L_f^\gamma h + v) \\y^{(\gamma)} = v$ i.e. the output becomes a $$\gamma$$-order linear system. $$\gamma$$ is called the strict relate degree

### MIMO system

Consider a two-input two-output system $\dot{x}=f(x)+g_{1}(x) u_{1}+g_{2}(x) u_{2} \\y=\begin{bmatrix}y_{1} \\y_{2}\end{bmatrix}=\begin{bmatrix}h_{1}(x) \\h_{2}(x)\end{bmatrix}$ Differentiate both outputs until inputs appeared: \begin{aligned}\begin{bmatrix}y_1^{(\gamma_1)} \\ y_2^{(\gamma_2)}\end{bmatrix} & =\begin{bmatrix}L_{g_{1}} L_{f}^{\gamma_{1}-1} h_{1} & L_{g_{2}} L_{f}^{\gamma_{1}-1} h_{1} \\L_{g_{1}} L_{f}^{\gamma_{2}-1} h_{2} & L_{g_{2}} L_{f}^{\gamma_{2}-1} h_{2}\end{bmatrix}\begin{bmatrix}u_1 \\ u_2\end{bmatrix} +\begin{bmatrix}L_f^{\gamma_1}h_1 \\ L_f^{\gamma_2}h\end{bmatrix} \\&\triangleq G(x) u + H(x)\end{aligned} The system has a relative degree $$(\gamma_1, \gamma_2)$$ at $$x^*$$ if $L_{g_{j}} L_{f}^{k} h_{i}(x)=0, \quad j=1,2, \quad 0 \leq k \leq \gamma_{i}-2, \quad i=1,2$ and $$G$$ is non-singular. Then $u = G^{-1}(x) [H(x) + v] \\\begin{bmatrix}y_1^{(\gamma_1)} \\ y_2^{(\gamma_2)}\end{bmatrix} =\begin{bmatrix}v_1 \\ v_2\end{bmatrix}$

### normal forms

If a SISO has a relative degree $$\gamma \le n$$ at some point $$x^*$$ then it can be transformed into a normal form, i.e. one can find a change of coordinates $$x\to\Phi(x)$$ s.t. $\Phi(x) = \begin{bmatrix}z \\ \eta\end{bmatrix} \triangleq\begin{bmatrix}h(x) \\ L_f h \\ \vdots \\ L_f^{\gamma-1}h \\\eta_1(x) \\ \vdots \\ \eta_{n-\gamma}(x)\end{bmatrix}$ The last $$n-\gamma$$ coordinates $$\eta$$ are chosen so that the following conditions holds:

1. $$\Phi(x)$$ is a diffeomorphism, i.e. a smooth map with smooth inverse. Equivalent to $$\partial \Phi$$ has full rank
2. The dynamics of $$\dot{\eta}$$ is not directly affected by $$u$$, i.e. $$L_g\eta_i = 0$$. It means that $$\eta$$ are the internal dynamics $$\dot{\eta} = w(t,z,\eta)$$

So we have two parts of the dynamics: $$z$$-dynamics and internal dynamics $\dot{z}=A z+B v \\\dot{\eta}=w(t, z, \eta)$ where $A = \begin{bmatrix}0 \\\vdots & I_{\gamma-1} \\0 & \cdots & 0\end{bmatrix} \qquadB = \begin{bmatrix}0 \\ \vdots \\ 0 \\ 1\end{bmatrix}$ $$z$$ is controllable. The virtual input $$u$$ can be chosen as $$v=Kz$$ s.t. $$A+BK$$ is Hurwitz. And the control is $u = \frac{1}{L_gL_f^{\gamma-1}h}(-L_f^\gamma h + v)$ If the zero dynamics $\dot{\eta} = w(t, 0, \eta)$ is A.S. then the system is minimum phase, otherwise it is non-minimum phase

# Backstepping

Backstepping is a nonlinear control design tool for underactuated systems.

## Integrator Backstepping

Consider \begin{align}\dot{\eta} &= f(\eta) + g(\eta)\xi \label{integrator} \\\dot{\xi} &= u \nonumber\end{align} where $$[\eta^T, \xi]\in\mathbb{R}^{n+1}$$ and $$u\in\mathbb{R}$$ is the control input. The functions $$f,g$$ are smooth in the domain that contains $$0$$ and $$f(0)=0$$. The goal is to design a controller which stabilizes the origin $$(\eta, \xi)=(0,0)$$

Assume there is a control law $$\xi=\phi(\eta)$$ which makes the subsystem $$\eqref{integrator}$$ A.S. with $$\phi(0)=0$$ and the Lyapunov function $$V_0(\eta)$$ s.t. $\frac{\partial V_{0}}{\partial \eta}[f(\eta)+g(\eta) \phi(\eta)] \leq-W(\eta) \qquad (\forall \eta \in D)$ where $$W(\eta)$$ is P.D. Now using

$V(\eta, \xi) = V_0(\eta) + \frac{1}{2} \|\xi-\phi(\eta)\|^2$ and we can obtain \begin{aligned}\dot{V} &=\frac{\partial V_{0}}{\partial \eta}(f+g\xi) + (\xi-\phi) (u - \frac{\partial \phi}{\partial \eta} \dot{\eta}) \\& \le -W(\eta) + (\xi-\phi) (u - \frac{\partial \phi}{\partial \eta} \dot{\eta} + \frac{\partial V_{0}}{\partial \eta} g)\end{aligned}

If we set $u=\frac{\partial \phi}{\partial \eta}(f+g\xi)-\frac{\partial V_{0}}{\partial \eta} g-k(\xi-\phi) \qquad (k> 0)$ then $$\dot{V} < 0$$. So that the origin $$(\eta, z)=(0,0)$$ is A.S. From $$\phi(0)=0$$ we can get the $$(\eta,\xi)=(0,0)$$ is A.S.

## Block backstepping

Consider \begin{align}\dot{\eta}&=f(\eta)+G(\eta) \xi \label{block} \\\dot{\xi}&=f_{a}(\eta, \xi)+G_{a}(\eta, \xi) u \nonumber\end{align} where $$\eta\in\mathbb{R}^n$$, $$\xi\in\mathbb{R}^m$$ and $$u\in\mathbb{R}^m$$ is the control input. The functions $$f, f_a, G, G_a$$ are smooth in the interested domain, $$f(0) = f_a(0) = 0$$ and $$G_a$$ is a non-singular $$m\times m$$ matrix.

Assume there is a control law $$\xi=\phi(\eta)$$ which makes the subsystem $$\eqref{block}$$ A.S. with $$\phi(0)=0$$ and the Lyapunov function $$V_0$$ s.t. $\frac{\partial V_{0}}{\partial \eta}[f(\eta)+G(\eta) \phi(\eta)] \le -W(\eta) \qquad \forall \eta \in D$ where $$W(\eta)$$ is P.D. Now using $V(\eta, \xi) = V_0(\eta) + \frac{1}{2} \|\xi-\phi(\eta)\|^2$ and we can obtain \begin{aligned}\dot{V}&=\frac{\partial V_{0}}{\partial \eta}(f+G \xi)+[\xi-\phi]^{T}\left[f_{a}+G_{a} u-\frac{\partial \phi}{\partial \eta}\dot{\eta}\right] \\&\le -W(\eta) + [\xi-\phi]^{T}\left[f_{a}+G_{a} u-\frac{\partial \phi}{\partial \eta}\dot{\eta} + \left(\frac{\partial V_{0}}{\partial \eta} G\right)^T \right] \end{aligned} If we set $u=G_{a}^{-1}\left[\frac{\partial \phi}{\partial \eta}(f+G \xi) - \left(\frac{\partial V_{0}}{\partial \eta} G\right)^{T} - f_a - k(\xi-\phi)\right] \qquad (k>0)$ then $$\dot{V} < 0$$. So that the origin $$(\eta, \xi)=(0,0)$$ is A.S.

# Lyapunov Redesign and Robust Backstepping

## Uncertainty and Lyapunov redesign

Consider $\begin{equation}\dot{x} = f(t,x) + G(t,x)[u + \delta(t,x,u)]\label{uncertain}\end{equation}$ where $$x\in\mathbb{R}^n$$ is the state and $$u\in\mathbb{R}^p$$ is the control. The functions $$f,G,\delta$$ are defined for $$(x,u)\in D\times \mathbb{R}^p$$ ($$D$$ contains the origin), piecewise continuous and Lipschitz in $$x$$ and $$u$$. Assume $$f,G$$ is know while $$\delta$$ is unknown.

When the uncertainty acts only along the control vector fields (the column of the matrix $$G$$) it is said to satisfy the matching condition, i.e. it matches the controls. $$\eqref{uncertain}$$ is in such form. Stabilizing controls for this case can be done by Lyapunov redesign. In the non-matching case, it is necessary to assume more restrictive assumptions about the bounds of $$\delta$$ and employ recursive techniques such as robust backstepping.

Assume that a feedback controller $$u=\psi(t,x)$$ was designed so that the nominal system $\dot{x} = f(t,x) + G(t,x)u$ is A.S. And the Lyapunov function $$V(t,x)$$ satisfies $\alpha_{1}(\|x\|) \leq V(t, x) \le \alpha_{2}(\|x\|) \\\partial_{t} V+\partial_{x} V \cdot[f(t, x)+G(t, x) \psi(t, x)] \le -\alpha_{3}(\|x\|)$ for all $$x\in D$$,$$\alpha_i$$ are strictly increasing and $$\alpha_i(0)=0$$

Assume the uncertainty satisfies the bound $\begin{equation}\|\delta(t, x, \psi(t, x)+v)\| \leq \rho(t, x)+k_{0}\|v\| \qquad (0 \leq k_{0}<1)\label{bound}\end{equation}$ where $$\rho:[0,t_f]\times D\to \mathbb{R}$$ is a non-negative continuous function and specifies the magnitude of the uncertainty. The idea behind Lyapunov redesign is to augment the nominal control law $$\psi(t,x)$$ with an extra term $$v\in\mathbb{R}^p$$ which suppresses the uncertainty so that the combined control $$u=\psi(t,x)+v$$ stabilizes the real system.

Now $$\dot{V}$$ becomes \begin{aligned}\dot{V}&=\partial_{t} V+\partial_{x} V \cdot[f+G \psi]+\partial_{x} V \cdot G[v+\delta] \\&\le -\alpha_{3}(\|x\|)+\partial_{x} V \cdot G[v+\delta]\end{aligned} Setting $$w^T = \partial_x V\cdot G$$ and it becomes $\dot{V}\le -\alpha_3 \|x\| + w^T v + w^T \delta$ Using the bound $$\eqref{bound}$$ we have $w^{T} v+w^{T} \delta \leq w^{T} v+\|w\| (\rho+k_{0}\|v\|)$ Setting $v = -\eta(t,x)\frac{w}{\|w\|} \\\eta(t,x) \ge \frac{\rho(t,x)}{1-k_0}$ we have \begin{aligned}w^{T} v+w^{T} \delta &\leq-\eta(x)\|w\|+\|w\|\left(\rho+k_{0} \eta(x)\right) \\&=\|w\|\left(\rho-\eta\left(1-k_{0}\right)\right) \\&\leq 0\end{aligned} So that $$\dot{V} \le 0$$ for the whole system.

The controller is discontinuous at $$w=0$$, e.g. typically at the origin. In addition to this theoretical limitation, practical issues also occur due to digital switching, delays, and other physical imperfections. This results in oscillatory behavior near the equilibrium called chattering. The solution is to smooth the control law near the origin: \begin{aligned}v&=-\eta(t, x) \frac{w}{\|w\|} && \text { if } \eta(t, x)\|w\| \geq \epsilon \\v&=-\eta(t, x)^{2} \frac{w}{\epsilon} && \text { if } \eta(t, x)\|w\|<\epsilon\end{aligned}

## Robust backstepping

Consider the single-input system \begin{align}\dot{\eta}&=f(\eta)+g(\eta) \xi+\delta_{\eta}(\eta, \xi) \label{robust-bs} \\\dot{\xi}&=f_{a}(\eta, \xi)+g_{a}(\eta, \xi) u+\delta_{\xi}(\eta, \xi) \nonumber\end{align} where $$\eta\in\mathbb{R}^n$$, $$\xi\in\mathbb{R}$$ are defined over a domain $$D$$ that contains the origin. Assume $$f,g,f_a,g_a$$ are smooth and known, $$f(0)=f_a(0)=0$$. And $$\delta_n, \delta_\xi$$ are uncertain terms that satisfy \begin{equation}\begin{aligned}\left\|\delta_{\eta}(\eta, \xi)\right\|_{2} &\leq a_{1}\|\eta\|_{2} \\\left|\delta_{\xi}(\eta, \xi)\right| &\leq a_{2}\|\eta\|_{2}+a_{3}|\xi|\end{aligned}\label{assume-1}\end{equation} for all $$(\eta, \xi)$$

Assume we have a stabilizing controller $$\xi=\phi(\eta)$$ for the subsystem $$\eqref{robust-bs}$$ and a Lyapunov function $$V_0(\eta)$$ s.t. $\frac{\partial V_{0}}{\partial \eta}\left[f(\eta)+g(\eta) \phi(\eta)+\delta_{\eta}(\eta, \xi)\right] \leq-b\|\eta\|^{2}$ for some $$b>0$$. Suppose further that $$\phi(\eta)$$ satisfies $\begin{equation}|\phi(\eta)| \leq a_{4}\|\eta\|, \quad\left\|\frac{\partial \phi}{\partial \eta}\right\| \leq a_{5}\label{assume-2}\end{equation}$ Consider a Lyapunov function $V(\eta, \xi) = V_0(\eta) +\frac{1}{2}[\xi - \phi(\eta)]^2$ we have $\dot{V}=\frac{\partial V_{0}}{\partial \eta}\left[f+g \phi+\delta_{\eta}\right]+\frac{\partial V_{0}}{\partial \eta} g(\xi-\phi)+(\xi-\phi)\left[f_{a}+g_{a} u+\delta_{\xi}-\frac{\partial \phi}{\partial \eta}\left(f+g \xi+\delta_{\eta}\right)\right]$ Taking $u=\frac{1}{g_{a}}\left[\frac{\partial \phi}{\partial \eta}(f+g \xi)-\frac{\partial V_{0}}{\partial \eta} g-f_{a}-k(\xi-\phi)\right] \qquad (k>0)$ so we have $\dot{V} \leq-b\|\eta\|^{2}+(\xi-\phi)\left[\delta_{\xi}-\frac{\partial \phi}{\partial \eta} \delta_{\eta}\right]-k(\xi-\phi)^{2}$ Using assumptions $$\eqref{assume-1}$$, $$\eqref{assume-2}$$ then \begin{aligned}\dot{V} & \leq-b\|\eta\|^{2}+2 a_{6}|\xi-\phi|\|\eta\|-\left(k-a_{3}\right)|\xi-\phi|^{2} \\&=-\begin{bmatrix}\|\eta\| \\|\xi-\phi|\end{bmatrix}^T\begin{bmatrix}b & -a_{6} \\-a_{6} & (k-a_{3})\end{bmatrix}\begin{bmatrix}\|\eta\| \\|\xi-\phi|\end{bmatrix}\end{aligned} for some $$a_6 > 0$$. Taking $$k\ge a_3 + a_6^2/b$$ yields $$\dot{V} \le 0$$

]]>
<p><a href="https://asco.lcsr.jhu.edu/en530-678-s2021-nonlinear-control-and-planning-in-robotics/">EN530.678</a> Review.</p>
Nonlinear Optimization https://silencial.github.io/nonlinear-optimization/ 2021-05-13T00:00:00.000Z 2021-05-13T00:00:00.000Z EN553.762 Review. Based on Numerical Optimization Book

# Constrained Optimization

$\begin{equation}\min_{x \in \mathbb{R}^n} f(x) \qquad \text{s.t. }\begin{cases}c_i(x) = 0, \quad i \in \mathcal{E} \\c_i(x) \ge 0, \quad i \in \mathcal{I}\end{cases}\label{con-optim}\end{equation}$

where $$f$$ and $$c_i$$ are all smooth, real-valued functions, and $$\mathcal{E}$$ and $$\mathcal{I}$$ are two finite sets of indices.

Definition: feasible set $$\Omega$$ $\Omega=\left\{x \mid c_{i}(x)=0, \quad i \in \mathcal{E} ; \quad c_{i}(x) \geq 0, \quad i \in \mathcal{I}\right\}$ Definition: Given $$x\in\Omega$$, the active set $$\mathcal{A}(x)$$ is $\mathcal{A}(x)=\mathcal{E} \cup\left\{i \in \mathcal{I} \mid c_{i}(x)=0\right\}$

## Tangent Cone and Constraint Qualifications

Definition: The vector $$d$$ is said to be a tangent vector to $$\Omega$$ at a point $$x$$ if there are a feasible sequence $$\{z_k\}$$ approaching $$x$$ and a sequence of positive scalars $$\{t_k\}$$ approaching $$0$$ s.t. $\lim _{k \rightarrow \infty} \frac{z_{k}-x}{t_{k}}=d$ The set of all tangents to $$\Omega$$ at $$x^*$$ is called the tangent cone and is denoted by $$T_\Omega(x^*)$$

Definition: Given $$x\in\Omega$$ and $$\mathcal{A}(x)$$, the set of linearized feasible directions $$\mathcal{F}(x)$$ is $\mathcal{F}(x)=\left\{d \mid \begin{array}{l}d^{T} \nabla c_{i}(x)=0, & \forall i \in \mathcal{E} \\d^{T} \nabla c_{i}(x) \geq 0, & \forall i \in \mathcal{A}(x) \cap \mathcal{I}\end{array}\right\}$ Definition: Given $$x$$ and $$\mathcal{A}(x)$$, the linear independence constraint qualification (LICQ) holds if the set of active constraint gradients $$\{\nabla c_i(x), i\in\mathcal{A}(x)\}$$ is linearly independent.

## First-Order Optimality Conditions

Define the Lagrangian function $\mathcal{L}(x, \lambda)=f(x)-\sum_{i \in \mathcal{E} \cup \mathcal{I}} \lambda_{i} c_{i}(x)$ and the first-order necessary conditions (a.k.a. the KKT conditions) are \begin{aligned}\nabla_{x} \mathcal{L}\left(x^{*}, \lambda^{*}\right) &=0 \\c_{i}\left(x^{*}\right) &=0, \quad \forall i \in \mathcal{E} \\c_{i}\left(x^{*}\right) & \geq 0, \quad \forall i \in \mathcal{I} \\\lambda_{i}^{*} & \geq 0, \quad \forall i \in \mathcal{I} \\\lambda_{i}^{*} c_{i}\left(x^{*}\right) &=0, \quad \forall i \in \mathcal{E} \cup \mathcal{I}\end{aligned}

### Proof

Lemma: Let $$x^*$$ be a feasible point. The following two statements are true

1. $$T_{\Omega}\left(x^{*}\right) \subset \mathcal{F}\left(x^{*}\right)$$
2. If the LICQ conditions is satisfied at $$x^*$$, then $$\mathcal{F}\left(x^{*}\right) = T_{\Omega}\left(x^{*}\right)$$

Farkas' lemma: Consider the cone $$K=\{By + Cw \mid y\ge 0\}$$, given any $$g\in \mathbb{R}^n$$, we have either

1. $$g\in K$$
2. $$\exist d \in \mathbb{R}^n$$ s.t. $$g^T d < 0, \quad B^T d\ge 0, \quad C^Td = 0$$

## Second-Order Conditions

Definition: the critical cone $$\mathcal{C}(x^*, \lambda^*)$$ is $\mathcal{C}\left(x^{*}, \lambda^{*}\right)=\left\{w \in \mathcal{F}\left(x^{*}\right) \mid \nabla c_{i}\left(x^{*}\right)^{T} w=0, \text { all } i \in \mathcal{A}\left(x^{*}\right) \cap \mathcal{I} \text { with } \lambda_{i}^{*}>0\right\}$ Second-order necessary conditions: if $$x^*$$ is a local solution and the LICQ condition is satisfied, then $w^{T} \nabla_{x x}^{2} \mathcal{L}\left(x^{*}, \lambda^{*}\right) w \geq 0, \quad \forall w \in \mathcal{C}\left(x^{*}, \lambda^{*}\right)$ Second-order sufficient conditions: $w^{T} \nabla_{x x}^{2} \mathcal{L}\left(x^{*}, \lambda^{*}\right) w > 0, \quad \forall w \in \mathcal{C}\left(x^{*}, \lambda^{*}\right), w\ne 0$

## Geometric Viewpoint

Definition: The normal cone to the set $$\Omega$$ at the point $$x\in \Omega$$ is $N_{\Omega}(x)=\left\{v \mid v^{T} w \leq 0, \ \forall w \in T_{\Omega}(x)\right\}$ Theorem: Suppose $$x^*$$ is a local minimizer of the following problem: $\min f(x) \qquad \text{s.t. } x\in\Omega$ then $$-\nabla f(x^*) \in N_\Omega(x^*)$$

Lemma: Suppose the LICQ holds at $$x^*$$, then $$N_\Omega(x^*) = -N$$, where $$N$$ is defined as $N=\left\{\sum_{i \in \mathcal{A}\left(x^{*}\right)} \lambda_{i} \nabla c_{i}\left(x^{*}\right), \quad \lambda_{i} \geq 0 \text { for } i \in \mathcal{A}\left(x^{*}\right) \cap \mathcal{I}\right\}$

## Duality

Consider the problem with only inequality constraints: $\min_{x\in\mathbb{R}^n} f(x) \qquad \text{s.t. } c(x) \ge 0 \\c(x) \triangleq (c_1(x), \dots, c_m(x))^T$ Definition: The dual problem to the above primal problem is $\max_{\lambda\in\mathbb{R}^n} q(\lambda) \qquad \text{s.t. } \lambda \ge 0 \\q(\lambda) \triangleq \inf_x \mathcal{L}(x, \lambda)$ Theorem: $$q$$ is concave and its domain $$D=\{\lambda \mid q(\lambda) > -\infty \}$$ is convex

Theorem: (Weak Duality) For any feasible $$x$$ of the primal problem and any $$\lambda \ge 0$$, we have $$q(\lambda) \le f(x)$$

Theorem: Suppose $$x$$ is a solution of the primal problem and that $$f$$ and $$-c_i$$ are convex and differentiable. Then any $$\lambda$$ for which $$(x,\lambda)$$ satisfies the KKT conditions is a solution of the dual problem

### Linear Programming

Linear programming and its dual: \begin{aligned}&\min_{x} c^T x &&\text{s.t } Ax - b \ge 0 \\&\max_{\lambda} b^T \lambda &&\text{s.t. } A^T\lambda = c,\ \lambda\ge 0\end{aligned} Another form: \begin{aligned}&\max_{x} c^T x &&\text{s.t. } A x - b\le 0, \ x\ge 0 \\&\min_{\lambda} b^T\lambda &&\text{s.t. } A^T \lambda - c\ge 0, \ \lambda\ge 0 \\\end{aligned} Theorem:

1. If either primal or dual has a (finite) solution, then so does the other, and the objective values are equal
2. If either primal or dual is unbounded, the the other problem is infeasible

### Quadratic Programming

Primal problem: \begin{aligned}&\min_x &&\frac{1}{2}x^TGx + c^Tx \\&\text{ s.t. } &&Ax-b\ge 0\end{aligned} Dual problem: \begin{aligned}&\max _{(\lambda, x)} &&\frac{1}{2} x^{T} G x+c^{T} x-\lambda^{T}(A x-b) \\&\text { s.t. } &&G x+c-A^{T} \lambda=0, \quad \lambda \geq 0\end{aligned}

# Penalty and Augmented Lagrangian Methods

## Quadratic Penalty Method

Consider optimization problem with only equality constraints $\begin{equation}\min_x f(x) \quad \text{s.t. } c_i(x) = 0, \quad i\in \mathcal{E}\label{equ-con}\end{equation}$ The quadratic penalty function $$Q(x; \mu)$$ for this formulation is $\begin{equation}Q(x; \mu) = f(x) + \frac{\mu}{2}\sum_{i\in\mathcal{E}} c_i^2(x)\label{qua}\end{equation}$ Algorithm:

1. Choose $$\mu_0$$, starting point $$x_0^s$$ and nonnegative sequence $$\{\tau_k\}$$ with $$\tau_k \to 0$$
2. For $$k=0,1,2,\dots$$
1. Find $$x_k \leftarrow \min Q(\cdot;\mu_k)$$, starting at $$x_k^s$$ and terminating when $$\|\nabla_x Q(x_k;\mu_k)\| \le \tau_k$$
2. Return $$x_k$$ if final convergence test is satisfied
3. Otherwise choose new $$\mu_{k+1} > \mu_k$$, e.g. $$\mu_{k+1} = 2\mu_k$$ and new starting point $$x_{k+1}^s = x_k$$

Theorem: Suppose $$x_k$$ is the exact global minimizer of $$Q(x;\mu_k)$$ defined by $$\eqref{qua}$$, and that $$\mu_k \to \infty$$. Then every limit point $$x^*$$ of the sequence $$\{x_k\}$$ is a global solution of problem $$\eqref{equ-con}$$

Theorem: Suppose in the general algorithmic framework we have $$\tau_k\to 0$$ and $$\mu_k\to \infty$$. If $$x^*$$ is a limit point of $$\{x_k\}$$ then

1. If $$x^*$$ is infeasible, then $$x^*$$ is a stationary point of $$\|c(x)\|^2$$
2. If $$x^*$$ is feasible and $$\nabla c_i(x^*)$$ are linearly independent, then $$x^*$$ is a KKT point for $$\eqref{equ-con}$$ with some $$\lambda^*$$ s.t. $$\lim_{k\to\infty} -\mu_k c_i(x_k) = \lambda_i^*$$

## Nonsmooth Penalty Function

For inequality constraints, we can use the $$\ell_1$$ penalty function defined by $\phi_{1}(x ; \mu)=f(x)+\mu \sum_{i \in \mathcal{E}}\left|c_{i}(x)\right|+\mu \sum_{i \in \mathcal{I}}\left[c_{i}(x)\right]^{-}$ Theorem: Suppose $$x^*$$ is a strict local solution of $$\eqref{con-optim}$$ at which the KKT conditions are satisfied, with Lagrange multipliers $$\lambda_i^*$$. Then $$x^*$$ is a local minimizer of $$\phi_1(x;\mu)$$ for all $$\mu > \mu^*$$, where $\mu^* = \|\lambda^*\|_\infty = \max_{i\in \mathcal{E} \cup \mathcal{I}} |\lambda_i^*|$ If, in addition, the second-order sufficient conditions hold and $$\mu>\mu^*$$, then $$x^*$$ is a strict local minimizer of $$\phi_1(x;\mu)$$

## Augmented Lagrangian Method

In the quadratic penalty method, the approximate minimizer $$x_k$$ of $$Q(x;\mu_k)$$ might not satisfy the feasibility conditions $$c_i(x) = 0$$. The are perturbed so that $c_i(x_k) \approx -\lambda_i^*/\mu_k, \quad \forall i \in \mathcal{E}$ In order to make the minimizer more nearly satisfy the equality constraints, we defined the augmented Lagrangian function which is a combination of the Lagrangian and the quadratic penalty function $\mathcal{L}_{A}(x, \lambda ; \mu) \triangleq f(x)-\sum_{i \in \mathcal{E}} \lambda_{i} c_{i}(x)+\frac{\mu}{2} \sum_{i \in \mathcal{E}} c_{i}^{2}(x)$ During the $$k$$-th iteration, $$\lambda^k$$ and $$\mu_k$$ are fixed and the minimization is performed only on $$x$$ $\nabla_x \mathcal{L}_A = \nabla f(x_k) - \sum_{i\in\mathcal{E}}[\lambda_i^k - \mu_k c_i(x_k)] \nabla c_i(x_k)$ After finding the approximate minimizer $$x_k$$ we can update $$\lambda_i^{k+1} = \lambda_i^k - \mu_k c_i(x_k)$$

Algorithm:

1. Choose $$\mu_0 > 0$$, tolerance $$\tau_0 > 0$$, starting point $$x_0^s$$ and $$\lambda^0$$
2. For $$k=0,1,2,\dots$$
1. Find $$x_k \leftarrow \min \mathcal{L}_A(\cdot, \lambda^k;\mu_k)$$, starting at $$x_k^s$$ and terminating when $$\|\nabla_x \mathcal{L}_A(x_k, \lambda^k;\mu_k)\| \le \tau_k$$
2. Return $$x_k$$ if final convergence test is satisfied
3. Otherwise update Lagrange multipliers $$\lambda^{k+1}=\lambda^k - \mu_k c(x_k)$$, choose new penalty parameter $$\mu_{k+1} \ge \mu_k$$, new tolerance $$\tau_{k+1}$$ and new starting point $$x_{k+1}^s = x_k$$

Theorem: Suppose $$x^*$$ is a local solution of $$\eqref{equ-con}$$ at which the LICQ is satisfied, and the second-order sufficient conditions are satisfied for $$\lambda^*$$. Then there is a threshold value $$\bar{\mu}$$ s.t. for all $$\mu\ge \bar{\mu}$$, $$x^*$$ is a strict local minimizer of $$\mathcal{L}_A(x, \lambda^*; \mu)$$

## Practical Augmented Lagrangian Method

Given inequality constraints, we can convert it to equality constraints and bound constraints by introducing slack variables $$s_i$$ $c_i(x) - s_i = 0, \quad s_i \ge 0, \quad \forall i\in\mathcal{I}$ By incorporating $$s_i$$ into $$x$$ and redefining $$c_i$$ accordingly, $$\eqref{con-optim}$$ can be written as $\min_{x \in \mathbb{R}^n} f(x) \qquad \text{s.t. }\begin{cases}c_i(x) = 0 \\l\le x\le u\end{cases}$ And the subproblem during each iteration becomes $\begin{equation}\min_x \mathcal{L}_A(x,\lambda;\mu)\qquad \text{s.t. } l\le x\le u\label{sub-lag}\end{equation}$ An efficient technique for solving the nonlinear program with bound constraints is the gradient projection method. The KKT conditions for $$\eqref{sub-lag}$$ are $x-P\left(x-\nabla_{x} \mathcal{L}_{A}(x, \lambda ; \mu), l, u\right)=0$ where $$P(g,l,u)$$ is the projection of the vector $$g$$ onto the rectangular box $$[l, u]$$ defined as $P(g, l, u)_{i}=\begin{cases}l_{i} & \text { if } g_{i} \le l_{i} \\g_{i} & \text { if } g_{i} \in\left(l_{i}, u_{i}\right) \\u_{i} & \text { if } g_{i} \ge u_{i}\end{cases}$

# Quadratic Programming

\begin{equation}\begin{aligned}&\min_{x} && \frac{1}{2} x^{T} G x+x^{T} c \\&\text{ s.t. } && a_{i}^{T} x=b_{i}, \quad i \in \mathcal{E} \\& && a_{i}^{T} x \geq b_{i}, \quad i \in \mathcal{I}\end{aligned}\label{qp}\end{equation}

If $$G\in\mathbb{S}_+$$, then it is a convex QP, and the difficulty in solving this problem is similar to a linear program.

## Equality-Constraints

\begin{aligned}&\min_{x} && \frac{1}{2} x^{T} G x+x^{T} c \\&\text{ s.t. } && Ax = b\end{aligned}

Where $$A$$ is the $$m\times n$$ Jacobian of constrains ($$m\le n$$). Assume $$A$$ has full row rank.

The KKT conditions are $\begin{equation}\begin{bmatrix}G & -A^T \\A & 0\end{bmatrix}\begin{bmatrix}x^* \\ \lambda^*\end{bmatrix} =\begin{bmatrix}-c \\ b\end{bmatrix}\label{qp-kkt}\end{equation}$ By expressing $$x^* = x + p$$ where $$x$$ is some estimate of the solution, then the KKT conditions become $\begin{bmatrix}G & A^T \\A & 0\end{bmatrix}\begin{bmatrix}-p \\ \lambda^*\end{bmatrix} =\begin{bmatrix}c + Gx \\ Ax - b\end{bmatrix}$ The matrix in this equation is called the KKT matrix.

Theorem: Let $$Z$$ be the $$n\times(n-m)$$ matrix whose columns are a basis for the null space of $$A$$. If $$A$$ has full row rank and the reduced-Hessian matrix $$Z^TGZ$$ is P.D., then the KKT matrix is nonsingular. Further the solution of $$\eqref{qp-kkt}$$ is the unique global solution of $$\eqref{qp}$$

## Inequality-Constraints

For problem $$\eqref{qp}$$, we can define the active set as $\mathcal{A}(x^*) =\{i\in\mathcal{E}\cup\mathcal{I} \mid a_i^T x^* = b_i \}$ and the KKT conditions are \begin{equation}\begin{aligned}G x^{*}+c-\sum_{i \in \mathcal{A}\left(x^{*}\right)} \lambda_{i}^{*} a_{i} &=0 & & \\a_{i}^{T} x^{*} &=b_{i} & & \forall i \in \mathcal{A}\left(x^{*}\right) \\a_{i}^{T} x^{*} & \geq b_{i} & & \forall i \in \mathcal{I} \backslash \mathcal{A}\left(x^{*}\right) \\\lambda_{i}^{*} & \geq 0 & & \forall i \in \mathcal{I} \cap \mathcal{A}\left(x^{*}\right)\end{aligned}\label{qp-kkt-general}\end{equation} Theorem: If $$x^*$$ satisfies the KKT conditions for some $$\lambda_i^*, i\in\mathcal{A}(x^*)$$, and $$G$$ is P.S.D., then $$x^*$$ is a global solution of $$\eqref{qp}$$

## Active-Set Methods

Assume $$G$$ is P.S.D. Let $$q(x) = \frac{1}{2} x^{T} G x+x^{T} c$$. Define the working set $$\mathcal{W}$$ to be some of the inequality constraints and all the equality constraints.

Given $$x_k$$ and the working set $$\mathcal{W}_k$$, first check whether $$x_k$$ minimizes $$q$$ in the subspace defined by the working set. If not, compute a step $$p$$ by solving the subproblem: \begin{aligned}&\min_p && q(x_k + p) \\&\text{ s.t. } && a_i^Tp = 0, \quad i\in\mathcal{W}_k\end{aligned} which is equivalent to \begin{equation}\begin{aligned}&\min_p && \frac{1}{2}p^TGp + g_k^T p \\&\text{ s.t. } && a_i^Tp = 0, \quad i\in\mathcal{W}_k\end{aligned}\label{qp-sub}\end{equation} where $$g_k = G x_k + c$$

If the optimal $$p_k$$ is nonzero, we need to compute a largest step length $$\alpha_k \in [0,1]$$ s.t. $$x_{k+1}=x_k + \alpha_k p_k$$ satisfies all constraints: $\begin{equation}\alpha_k = \min(1, \min_{i\notin\mathcal{W}_k,\ a_i^Tp_k < 0} \frac{b_i-a_i^T x_k}{a_i^T p_k})\label{alpha}\end{equation}$ If $$\alpha_k<1$$, that means the step along $$p_k$$ was blocked by some constraint not in $$\mathcal{W}_k$$, so we construct $$\mathcal{W}_{k+1}$$ by adding one of the blocking constraints to $$\mathcal{W}_k$$.

Continue the iterations until the subproblem has the solution $$p=0$$. Then from $$\eqref{qp-kkt}$$ we have $\begin{equation}\sum_{i \in \hat{\mathcal{W}}} a_{i} \hat{\lambda}_{i}=g=G \hat{x}+c\label{multiplier}\end{equation}$ for some $$\hat{\lambda}_i,\ i\in\hat{\mathcal{W}}$$. Define the multipliers not in the working set to be zero. Then the first three KKT conditions $$\eqref{qp-kkt-general}$$ all satisfied.

If all the multipliers with indices $$i\in \hat{\mathcal{W}} \cap \mathcal{I}$$ are nonnegative , then $$\hat{x}$$ is the global solution of $$\eqref{qp}$$. If one or more of the multipliers is negative, then we must remove the corresponding indices from the working set and continue the iteration.

Theorem: Suppose the point $$\hat{x}$$ satisfies the KKT conditions for the equality-constrained subproblem with working set $$\hat{\mathcal{W}}$$. Suppose, too, that the $$a_i, \ i\in\hat{\mathcal{W}}$$ are linearly independent and there is an index $$j\in\hat{\mathcal{W}}$$ s.t. $$\hat{\lambda}_j < 0$$. Let $$p$$ be the solution obtained by dropping the constraint $$j$$ from the original problem: \begin{equation}\begin{aligned}&\min_p && \frac{1}{2}p^TGp + g_k^T p \\&\text{ s.t. } && a_i^Tp = 0, \quad i\in\mathcal{W}_k,\ i\ne j\end{aligned}\label{drop}\end{equation} Then $$p$$ is a feasible direction for constraint $$j$$, that is, $$a_j^Tp \ge 0$$. Moreover, if $$p$$ satisfies second-order sufficient conditions for $$\eqref{drop}$$, then $$a_j^T p > 0$$, and $$p$$ is a descent direction for $$q$$.

Theorem: Suppose that the solution $$p_k$$ of $$\eqref{qp-sub}$$ is nonzero and satisfies the second-order sufficient conditions for that problem. Then $$q$$ is strictly decreasing along the direction of $$p_k$$

Algorithm:

1. Compute a feasible starting point $$x_0$$ and set $$\mathcal{W}_0$$ to be subset of the active constraints at $$x_0$$
2. For $$k=0,1,2,\dots$$
1. Solve $$\eqref{qp-sub}$$ to find $$p_k$$
2. If $$p_k = 0$$
1. Compute Lagrange multipliers $$\hat{\lambda}_i$$ that satisfy $$\eqref{multiplier}$$
2. Return $$x_k$$ if $$\hat{\lambda}_i \ge 0$$ for all $$i\in \hat{\mathcal{W}} \cap \mathcal{I}$$
3. Otherwise find $$j=\arg\min_{j\in \hat{\mathcal{W}} \cap \mathcal{I}} \hat{\lambda}_j$$ and update $$\mathcal{W}_{k+1} = \mathcal{W}_k \setminus \{j\}$$, $$x_{k+1} \leftarrow x_k$$
3. Else
1. Compute $$\alpha_k$$ from $$\eqref{alpha}$$ and update $$x_{k+1} \leftarrow x_k + \alpha_k p_k$$
2. Add blocking constraints to $$\mathcal{W}_k$$ to obtain $$\mathcal{W}_{k+1}$$

### Initial Feasible Point

We can determined the initial feasible point $$\tilde{x}$$ by solving the linear program: \begin{aligned}& \min _{(x, z)} && e^{T} z & & \\&\text { s.t. } && a_{i}^{T} x+\gamma_{i} z_{i} =b_{i}, & i \in \mathcal{E} \\& && a_{i}^{T} x+\gamma_{i} z_{i} \geq b_{i}, & i \in \mathcal{I} \\& && z \geq 0, &\end{aligned} where $$e=(1,1,\dots, 1)^T$$, $$\gamma_{i}=-\operatorname{sign}(a_i^T \tilde{x}-b_i)$$ for $$i\in\mathcal{E}$$, and $$\gamma_i =1$$ for $$i\in\mathcal{I}$$. A feasible point for this problem is $x = \tilde{x} \qquad z_i = \begin{cases}|a_i^T \tilde{x} - b_i|, &i\in\mathcal{E} \\\max(b_i - a_i^T\tilde{x}, 0), &i\in\mathcal{I}\end{cases}$ An alternative approach is a penalty (or big M) method which introduces a scalar artificial variable $$\eta$$ into $$\eqref{qp}$$ to measure the constraint violation, and solve the problem: \begin{aligned}&\min_{x} & \frac{1}{2} x^{T} G x+x^{T} c +M\eta \\&\text { s.t. } & a_{i}^{T} x - b_{i} \le \eta, && i \in \mathcal{E} \\& & -(a_{i}^{T} x - b_{i}) \le \eta, && i \in \mathcal{E} \\& & b_i - a_i^T x \le \eta, && i\in \mathcal{I} \\& & 0 \le \eta,\end{aligned} for some large positive value of $$M$$. It can be shown that whenever there exist feasible points for the original problem $$\eqref{qp}$$, then for all $$M$$ sufficiently large, the solution of the penalty method will have $$\eta=0$$, with an $$x$$ component that is the solution for $$\eqref{qp}$$.

# Sequential Quadratic Programming

## Local SQP Method

Consider the equality constrained problem: \begin{aligned}&\min && f(x) \\&\text{ s.t. } && c(x) = 0\end{aligned} where $$f:\mathbb{R}^n \to \mathbb{R}$$ and $$c:\mathbb{R}^n\to\mathbb{R}^m$$ are smooth. The idea behind SQP is to model this problem at the current iterate $$x_k$$ by a QP subproblem, then use the minimizer of this subproblem to define a new iterate $$x_{k+1}$$. The simplest way is to apply Newton's method to the KKT conditions.

Let $$A(x)$$ denote the Jacobian matrix of the constraints: $A(x)^T = [\nabla c_1(x), \nabla c_2(x), \dots, \nabla c_m(x)]$ The KKT conditions are $F(x,\lambda) = \begin{bmatrix}\nabla f(x) - A(x)^T \lambda \\c(x)\end{bmatrix} = 0$ This can be solved by Newton's method: $\begin{bmatrix}x_{k+1} \\ \lambda_{k+1}\end{bmatrix} =\begin{bmatrix}x_k \\ \lambda_k\end{bmatrix} +\begin{bmatrix}p_k \\ p_\lambda\end{bmatrix}$ where $$p_k$$ and $$p_\lambda$$ solve the Newton-KKT system: $\begin{equation}\begin{bmatrix}\nabla_{xx}^2 \mathcal{L}_k & -A_k^T \\A_k & 0\end{bmatrix}\begin{bmatrix}p_k \\ p_\lambda\end{bmatrix} =\begin{bmatrix}-\nabla f_k + A_k^T \lambda_k \\-c_k\end{bmatrix}\label{newton-kkt}\end{equation}$ The Newton iteration is well defined when the KKT matrix is nonsingular.

### SQP Framework

An alternative way to view the iteration is to consider a quadratic problem at the iterate $$(x_k, \lambda_k)$$: \begin{equation}\begin{aligned}&\min_p && \nabla f_k^T p + \frac{1}{2}p^T \nabla_{xx}^2 \mathcal{L}_k p \\&\text{ s.t. } && A_kp + c_k = 0\end{aligned}\label{sqp-sub}\end{equation} If the KKT matrix is nonsingular, this problem has a unique solution $$(p_k, l_k)$$ that satisfies the KKT conditions: $\begin{equation}\begin{bmatrix}\nabla_{xx}^2 \mathcal{L}_k & -A_k^T \\A_k & 0\end{bmatrix}\begin{bmatrix}p_k \\ l_k\end{bmatrix} =\begin{bmatrix}-\nabla f_k \\-c_k\end{bmatrix}\label{sqp-kkt}\end{equation}$ Compared with $$\eqref{newton-kkt}$$ we can see that $$l_k = p_\lambda + \lambda_k = \lambda_{k+1}$$.

Algorithm:

1. Choose an initial pair $$(x_0, \lambda_0)$$, set $$k=0$$
2. Repeat until convergence
1. Evaluate $$\nabla f_k, \nabla_{xx}^2 \mathcal{L}_k, c_k, A_k$$
2. Solve $$\eqref{sqp-sub}$$ for $$p_k$$ and $$l_k$$
3. Update $$x_{k+1} \leftarrow x_k + p_k$$ and $$\lambda_{k+1} \leftarrow l_k$$

### Inequality Constraints

The SQP framework can be extended to general nonlinear programming problem $$\eqref{con-optim}$$ as \begin{equation}\begin{aligned}&\min_p && \nabla f_k^T p + \frac{1}{2}p^T \nabla_{xx}^2 \mathcal{L}_k p \\&\text{ s.t. } && \nabla c_i(x_k)^T p + c_i(x_k) = 0, \quad i\in\mathcal{E} \\& && \nabla c_i(x_k)^T p + c_i(x_k) \ge 0, \quad i\in\mathcal{I}\end{aligned}\label{qp-sub-ineq}\end{equation} Theorem: Suppose $$(x^*,\lambda^*)$$ is a local solution of $$\eqref{con-optim}$$, and LICQ, strict complementarity condition, second-order sufficient conditions hold. Then if $$(x_k, \lambda_k)$$ is sufficiently close to $$(x^*, \lambda^*)$$, there is a local solution of the subproblem $$\eqref{qp-sub-ineq}$$ whose active set $$\mathcal{A}_k$$ is the same as the active set $$\mathcal{A}(x^*)$$ of $$\eqref{con-optim}$$.

## Algorithmic Development

### Handling Inconsistent Linearizations

The linearizations of the nonlinear constraints $$\eqref{qp-sub-ineq}$$ may give an infeasible subproblem. To overcome this difficulty, we can reformulate $$\eqref{con-optim}$$ as the $$\ell_1$$ penalty problem: \begin{equation}\begin{aligned}&\min_{x, v, w, t} && f(x)+\mu \sum_{i \in \mathcal{E}}\left(v_{i}+w_{i}\right)+\mu \sum_{i \in \mathcal{I}} t_{i} \\&\text { s.t. } && c_{i}(x)=v_{i}-w_{i}, \quad i \in \mathcal{E} \\& &&c_{i}(x) \geq-t_{i}, \quad i \in \mathcal{I} \\& &&v,\ w,\ t \geq 0\end{aligned}\label{l1-penalty}\end{equation} for some positive choice of the penalty parameter $$\mu$$. The quadratic subproblem associated with it is always feasible. When $$\mu$$ is sufficiently large, then the solution $$x^*$$ coincides with the original problem.

### Merit Functions

A merit function can be used to decide whether a trial step should be accepted. In line search methods, it controls the size of the step; in trust-region methods it determines whether the step is accepted or rejected and whether the radius should be adjusted.

Consider the $$\ell_1$$ merit function and only equality constraints: $\phi_{1}(x ; \mu)=f(x)+\mu\|c(x)\|_1$ In a line search method, a step $$\alpha_k p_k$$ will be accepted if the following condition holds: $\begin{equation}\phi_{1}(x_{k}+\alpha_{k} p_{k} ; \mu_{k}) \leq \phi_{1}(x_{k}, \mu_{k})+\eta \alpha_{k} D(\phi_1(x_{k} ; \mu) ; p_{k}), \quad \eta \in(0,1)\label{direction}\end{equation}$ where $$D(\phi_1(x_{k} ; \mu) ; p_{k})$$ is the directional derivative of $$\phi_1$$ in the direction $$p_k$$.

Theorem: Let $$p_k$$ and $$\lambda_{k+1}$$ be generated by the SQP iteration $$\eqref{sqp-kkt}$$. Then we have

• $$D(\phi_1(x_{k} ; \mu) ; p_{k}) = \nabla f_k^T p_k -\mu \|c_k\|_1$$
• $$D(\phi_1(x_{k} ; \mu) ; p_{k}) \le -p_k^T\nabla_{xx}^2 \mathcal{L}_k p_k - (\mu - \|\lambda_{k+1}\|_\infty) \|c_k\|_1$$

## Line Search SQP Method

Algorithm:

1. Choose $$\eta\in(0,0.5)$$, $$\tau\in(0, 1)$$ and an initial pair $$(x_0, \lambda_0)$$
2. Evaluate $$f_0, \nabla f_0, c_0, A_0, \nabla_{xx}^2\mathcal{L}_0$$
3. Repeat until convergence
1. Solve $$\eqref{qp-sub-ineq}$$ for $$p_k$$ and $$\hat{\lambda}$$
2. Set $$p_\lambda \leftarrow \hat{\lambda} - \lambda_k$$
3. Choose $$\mu_k$$ large enough so that $$p_k$$ is the descent direction for $$\phi_1$$
4. Set $$\alpha_k \leftarrow 1$$ and Repeat $$\alpha_k \leftarrow \tau_a \alpha_k$$ for some $$\tau_a\in(0,\tau]$$ until $$\eqref{direction}$$ holds
5. Set $$x_{k+1}\leftarrow x_k + \alpha_k p_k$$ and $$\lambda_{k+1}\leftarrow \lambda_k + \alpha_k p_\lambda$$
6. Evaluate $$f_{k+1}, \nabla f_{k+1}, c_{k+1}, A_{k+1}, \nabla_{xx}^2\mathcal{L}_{k+1}$$

## Trust-Region SQP Methods

Trust-region methods can control the quality of the steps even when the Hessian $$\nabla_{xx}^2 \mathcal{L}_k$$ is not positive definite, and they provide a mechanism for enforcing global convergence.

Add a trust-region constraint to the subproblem $$\eqref{qp-sub-ineq}$$ to get \begin{aligned}&\min_p && \nabla f_k^T p + \frac{1}{2}p^T \nabla_{xx}^2 \mathcal{L}_k p \\&\text{ s.t. } && \nabla c_i(x_k)^T p + c_i(x_k) = 0, \quad i\in\mathcal{E} \\& && \nabla c_i(x_k)^T p + c_i(x_k) \ge 0, \quad i\in\mathcal{I} \\& && \|p\| \le \Delta_k\end{aligned} After adding the constraint, the problem may not have a solution. However, the idea is not to satisfy the other constraints at every step, but to improve the feasibility of these constraints.

### Relaxation Method

Consider the equality constraints only. At iteration $$x_k$$, we solve the subproblem: \begin{aligned}&\min_p && \nabla f_k^T p + \frac{1}{2}p^T \nabla_{xx}^2 \mathcal{L}_k p \\&\text{ s.t. } && A_kp + c_k = r_k \\& && \|p\|_2 \le \Delta_k\end{aligned} The choice of the relaxation vector $$r_k$$ impacts the efficiency of the method. To do so, first solve the auxiliary problem: \begin{aligned}&\min_v && \|A_k v + c_k\|_2^2 \\&\text{ s.t. } && \|v\|_2 \le 0.8\Delta_k\end{aligned} and set $$r_k = A_k v + c_k$$. Now we can compute $$p_k$$ and update $$x_{k+1} = x_k+p_k$$ and $$\lambda_{k+1} = (A_kA_k^T)^{-1}A_k \nabla f_k$$

]]>
<p>EN553.762 Review. Based on <a href="https://www.csie.ntu.edu.tw/~r97002/temp/num_optimization.pdf">Numerical Optimization Book</a></p>
Python Tricks https://silencial.github.io/python-tricks/ 2021-04-26T00:00:00.000Z 2021-05-03T00:00:00.000Z 关于 Python 的一些知识收集

# 语法

## 装饰器

### 内置装饰器

1. @property: 使得调用类中的成员函数像调用成员变量一样
2. @staticmethod: 可以由类名直接调用，表明改函数不需要访问该类
3. @classmethod: 可以由类名直接调用，只能访问类变量。默认第一个参数是类本身。常用来提供额外的构造器

## 多重继承

super 的工作原理如下：

## 上下文管理

Python 的 contextmanager 装饰器提供了更简单的写法：

## 魔术方法

### 表示

• 如果想要将自定义类放入 set 或作为 dict 的键，则必须重写 __hash___eq__ 两个魔术方法
• __str__ 在使用 print 函数时被调用，__repr__print 一个 list/set/dict 时被调用

# 并发编程

• 并发：一个时间段内，有多个程序在同一个 CPU 上运行，但是任意时刻只有一个程序在 CPU 上运行。
• 并行：在任意时刻，有多个程序运行在多个 CPU 上

## 多线程

• 由于 Python 全局锁 GIL 的存在，多线程无法利用多核优势，不适合计算密集型任务。
• 线程之间可以使用全局变量进行通信。

• 程序需要维护很多共享状态时（list/dict/set）
• 程序的大量时间花费在 I/O 操作上，如爬虫

## 多进程

• 计算密集型任务
• 程序的输入可以并行的分块，并且运算结果可合并

## 异步 I/O

• async def: 定义一个协程函数，内部可以使用 await 语句
• await: 暂停该协程的执行，同时让其它协程运行，直到 await 后面的语句执行完毕

# 密码学

• AES：对称加密算法
• RSA：非对称加密算法
• md5：加密哈希函数。可以将字符串转换为 32 字节字符串
• base64：基于 64 个可打印字符编码二进制数据。常用在网页中传递少量二进制数据。

# 测试

## 覆盖率测试

• 命令行使用 coverage run -m pytest [args]，会在当前目录下生成 .coverage 文件

• coverage report -m：显示生成的报告
• coverage html：以网页的形式展示报告，可以互动

# Python 包

## 打包

• packages：需要被打包的代码所在的文件夹
• install_requires：项目依赖项
• console_scripts：将项目中的某个函数注册为可执行命令

### 版本编号

• 增加大版本编号：存在与之前版本不兼容的情况
• 增加小版本编号：增加新功能，但不影响之前的版本
• 增加补丁编号：修改 bug

## 安装

pip install -e .：在当前目录下寻找 setup.py 文件并以编辑模式安装包，即修改代码后不需要重新安装包。

# 坑

## 默认参数

]]>
<p>关于 Python 的一些知识收集</p>
Python 爬虫 https://silencial.github.io/scraper/ 2021-03-20T00:00:00.000Z 2021-04-03T00:00:00.000Z Python 爬虫 101

# 网页分析

## HTTP 请求

• General:
• Request URL: 请求的网址
• Request Methods: 请求方式，最主要的有 GET 和 POST。其中 GET 是获取信息，而 POST 是提交信息。
• Status Code: 返回码，请求成功一般是 200，不同代码有不同的含义
• Request Headers: 请求头，在请求时发送至服务器的数据
• User-Agent: 表明发送请求的系统平台、浏览器等
• Cookie: 用于告诉服务器两个请求是否来自同一个浏览器，比如保持用户的登陆状态
• Response Headers: 响应头
• Content-Type: 数据类型

## Example

1. 打开网页及开发者工具，定位至 Network 面板，刷新网页，点击最上方的 URL

2. 查看 Headers 栏，发现返回文件类型为 text/html

3. 查看 Preview 栏可以看到一个新的「网页」，==可能和原始网页存在区别！==，因为网页上的某些内容并不是由 HTML 决定的，此时需要分析其它请求

4. 查看 Response 栏，即源格式 html，搜索找到需要的内容，并分析结构，如下

5. 总共有 10 页，每页存在 25 个上述条目。点击第二页可以发现 URL 变为了 https://movie.douban.com/top250?start=25&filter=。可以猜测其中的 ?start=25&filter= 变量达到了翻页的效果，可以在浏览器更改 start 变量值进一步确认

6. 设计程序框架：遍历请求 URL，使用 start 变量进行翻页，获取响应内容后通过上述结构找到需要的内容

# Request

Request 是一个模拟 HTTP 请求的 Python 库。一些常见用法：

## BeautifulSoup

BeautifulSoup 是一个从 HTML/XML 文件提取数据的 Python 库。一些常见用法：

# Selenium

Selenium 是一个用于浏览器的自动化测试工具，用来模拟人操作网页，提供 Python 的 API。一些常见用法：

# Ajax

Ajax 是 JS 异步更新页面上的内容的一种方法。即没有刷新页面，但页面内容可以变换。

# 伪装操作

## 时间间隔

]]>
<p>Python 爬虫 101</p> 13
10031
10000314
10000003141

# 解法 1    # 解法 2   ]]>
<p><strong>问题：</strong></p> <p>想象光滑的地面有两个静止的小滑块，左面有一堵质量无穷大的墙。左边的小滑块质量为 <span class="math inline">$$1$$</span>。此时给右边的小滑块一个向左的速度，假设所有的碰撞都为完全弹性碰撞，请问总共的碰撞次数？（滑块之间的碰撞+左滑块和墙的碰撞）</p> <p><img src="https://i.imgur.com/43fJNPK.png" alt="Collision" /></p> <p>当右滑块的质量为 <span class="math inline">$$100^k$$</span> 时，总共发生碰撞的次数为 <span class="math inline">$$\lfloor 10^k \pi \rfloor$$</span></p> <table> <thead> <tr class="header"> <th style="text-align: center;">右滑块质量</th> <th style="text-align: center;">碰撞次数</th> </tr> </thead> <tbody> <tr class="odd"> <td style="text-align: center;">1</td> <td style="text-align: center;">3</td> </tr> <tr class="even"> <td style="text-align: center;">100</td> <td style="text-align: center;">31</td> </tr> <tr class="odd"> <td style="text-align: center;">10000</td> <td style="text-align: center;">314</td> </tr> <tr class="even"> <td style="text-align: center;">1000000</td> <td style="text-align: center;">3141</td> </tr> </tbody> </table>
Robotic Manipulation https://silencial.github.io/robotic-manipulation/ 2020-12-30T00:00:00.000Z 2020-12-30T00:00:00.000Z EN530.646 Review. Based on A Mathematical Introduction to Robotic Manipulation book.

# Math Preliminaries

## Vector Space

A vector space over a field $$F$$ is a set $$V$$ together with two operations $$(+, \cdot)$$ defined as

1. Vector addition $$+$$: $$V\times V \rightarrow V$$
2. Scalar multiplication $$\cdot$$: $$F\times V \rightarrow V$$

It has to satisfy eight axioms, $$\forall x, y, z \in V$$ and $$\forall \alpha, \beta \in F$$

1. $$x+y=y+x$$
2. $$(x+y)+z = x+(y+z)$$
3. $$\exists 0\in V$$, s.t. $$x+0=x$$
4. $$\exists -x \in V$$, s.t. $$x + (-x) = 0$$
5. $$\alpha \cdot(x+y) = \alpha\cdot x + \alpha\cdot y$$
6. $$(\alpha+\beta)\cdot x = \alpha\cdot x + \beta\cdot x$$
7. $$(\alpha\beta)\cdot x = \alpha\cdot(\beta\cdot x)$$
8. $$1\cdot x = x$$, where $$1$$ is the multiplicative identity in $$F$$

# Rigid Body Motion

## Rotational Motion

Let $$A$$ be the inertial frame, $$B$$ the body frame, and $$\mathbf{x}_{ab}, \mathbf{y}_{ab},\mathbf{z}_{ab} \in \mathbb{R}^3$$ the coordinated of the principal axes of $$B$$ relative to $$A$$. Define the $$3\times 3$$ rotation matrix $R_{ab} = \begin{bmatrix}\mathbf{x}_{ab} & \mathbf{y}_{ab} & \mathbf{z}_{ab}\end{bmatrix}$ Properties:

1. $$RR^T=R^TR=I$$
2. $$\det R = \pm 1$$. If the coordinate frame is right-handed, then $$\det R = 1$$

Let $$q_a, q_b$$ be the coordinates of a point $$q$$ relative to frames $$A$$ and $$B$$, then we have $q_a = R_{ab}q_b$

### Special Orthogonal Group

Define $$SO(n)$$ as $SO(n) = \{R\in\mathbb{R}^{n\times n} : RR^T=I, \det R = 1 \}$ $$SO(n)$$ is a group under the operation of matrix multiplication.

A set $$G$$ with a binary operation $$\circ$$ defined on elements of $$G$$ is called a group if it satisfies the following axioms:

1. Closure: If $$g_1, g_2\in G$$, then $$g_1 \circ g_2 \in G$$
2. Identity: There exists an identity element $$e$$, s.t. $$g\circ e=e\circ g = g$$ for every $$g\in G$$
3. Inverse: For each $$g\in G$$, there exists only one inverse $$g^{-1}\in G$$, s.t. $$g\circ g^{-1} = g^{-1} \circ g = e$$
4. Associativity: If $$g_1, g_2, g_3 \in G$$, then $$(g_1\circ g_2)\circ g_3 = g_1\circ(g_2\circ g_3)$$

Defined $$so(n)$$ as $so(n)=\{S\in\mathbb{R}^{n\times n}: S^T=-S \}$

### Exponential Coordinates

Let $$\omega \in \mathbb{R}^3$$ be a unit vector which specifies the direction of the rotation and $$\theta \in \mathbb{R}$$ be the angle of rotation. The rotation matrix can be represented as $R(\omega, \theta) = e^{\widehat{\omega} \theta}$ where $$\widehat{\omega} \in so(3)$$ with the property $$\omega \times b = \widehat{\omega}b$$ $\widehat{\omega} = \begin{bmatrix}0 & -\omega_3 & \omega_2 \\\omega_3 & 0 & -\omega_1 \\-\omega_2 & \omega_1 & 0\end{bmatrix}$ When $$\|\omega\| = 1$$ $e^{\widehat{\omega}\theta} = I + \widehat{\omega}\sin\theta + \widehat{\omega}^2 (1-\cos\theta)$ The exponential map from $$so(3)$$ to $$SO(3)$$ is surjective

Important properties: $R\widehat{\omega}R^T = (R\omega)^\wedge \\$

### Euler Angles

$R_x(\alpha):=e^{\widehat{x}\alpha} = \begin{bmatrix}1 & 0 & 0 \\0 & \cos\alpha & -\sin\alpha \\0 & \sin\alpha & \cos\alpha \\\end{bmatrix} \\R_y(\beta):=e^{\widehat{y}\beta} = \begin{bmatrix}\cos\beta & 0 & \sin\beta \\0 & 1 & 0 \\-\sin\beta & 0 & \cos\beta\end{bmatrix} \\R_z(\gamma):=e^{\widehat{z}\gamma} = \begin{bmatrix}\cos\gamma & -\sin\gamma & 0 \\\sin\gamma & \cos\gamma & 0\\0 & 0 & 1\end{bmatrix}$

$$ZYZ$$ is a commonly used Euler angles: Start with frame $$B$$ coincident with frame $$A$$. First rotate $$B$$ about the $$z$$-axis of frame $$B$$ by an angle $$\alpha$$, then rotate about the (new) $$y$$-axis of frame $$B$$ by an angle $$\beta$$, and then rotate about the (new) $$z$$-axis of frame $$B$$ by an angle of $$\gamma$$. The rotation of $$B$$ relative to A is $R_{ab} = R_z(\alpha)R_y(\beta)R_z(\gamma)$

### Quaternions

\begin{aligned}Q &= q_0+q_1 \mathbf{i} + q_2\mathbf{j} + q_3\mathbf{k} \qquad q_i\in\mathbb{R} \\&=(q_0, \vec{q})\end{aligned}

where $\mathbf{i}\cdot\mathbf{i}=\mathbf{j}\cdot\mathbf{j}=\mathbf{k}\cdot\mathbf{k}=-1 \\\mathbf{i}\cdot\mathbf{j}=-\mathbf{j}\cdot\mathbf{i}=\mathbf{k} \qquad \mathbf{j}\cdot\mathbf{k}=-\mathbf{k}\cdot\mathbf{j}=\mathbf{i} \qquad \mathbf{k}\cdot\mathbf{i}=-\mathbf{i}\cdot\mathbf{k}=\mathbf{j}$ Definitions:

1. The conjugate of a quaternion: $$Q^*=(q_0, -\vec{q})$$
2. $$\|Q\|^2=Q\cdot Q^*=q_0^2 + q_1^2 + q_2^2 + q_3^2$$
3. $$Q\cdot P=(q_0 p_0 - \vec{q}\cdot\vec{p}, q_0 \vec{p} + p_0\vec{q}+\vec{q}\times\vec{p})$$

Given a rotation matrix $$R=e^{\widehat{\omega}\theta}$$, define the unit quaternion as $Q=(\cos(\theta / 2), \omega\sin(\theta / 2))$ To rotate a point $$x$$, first let $$X=(0, \vec{x})$$ be a pure quaternion, then $QXQ^*$ is also a pure quaternion and the vector part is the rotated $$x$$

## Rigid Motion

Rigid motions consist of rotation $$R_{ab}$$ and translation $$p_{ab}$$ is an affine transformation $q_a = p_{ab} + R_{ab}q_b$ By using homogeneous coordinates, it can be represented in linear form $\bar{q}_a=\begin{bmatrix}q_1 \\ 1\end{bmatrix}=\begin{bmatrix}R_{ab} & p_{ab} \\0 & 1\end{bmatrix}\begin{bmatrix}q_b \\ 1\end{bmatrix}=: \bar{g}_{ab}\bar{q}_b$

### Special Euclidean Group

Define $$SE(3)$$ as $SE(3)=\{(p, R): p\in\mathbb{R}^3, R\in SO(3)\}$ Define $$se(3)$$ as $se(3):=\{(v, \widehat{\omega}: v\in\mathbb{R}^3, \widehat{\omega} \in so(3) \}$

### Exponential Coordinates

Similar to $$SO(3)$$, the exponential mapping can be generalized to $$SE(3)$$ $\bar{g}=e^{\widehat{\xi} \theta} \\\widehat{\xi} = \begin{bmatrix}\widehat{\omega} & -\omega\times q \\0 & 0\end{bmatrix} \in se(3)$ where $$\omega$$ is the axis of rotation and $$q$$ is a point on the axis.

$$\xi:=(v, \omega)$$ is the twist coordinates of $$\widehat{\xi}$$ $\begin{bmatrix}v \\ \omega\end{bmatrix}^\wedge = \begin{bmatrix}\widehat{\omega} & v \\0 & 0\end{bmatrix}$ When $$\|\omega\| = 1$$ $\begin{equation}e^{\widehat{\xi} \theta} = \begin{bmatrix}e^{\widehat{\omega} \theta} & (I - e^{\widehat{\omega}\theta}) (\omega \times v) + \omega\omega^Tv\theta) \\0 & 1\end{bmatrix} \\\label{gexp}\end{equation}$ The exponential map from $$se(3)$$ to $$SE(3)$$ is surjective

Note that this represents the ==relative== motion of a rigid body: $p(\theta) = e^{\widehat{\xi}\theta} p(0) \\g_{ab}(\theta) = e^{\widehat{\xi}\theta}g_{ab}(0)$

## Screws

Screw motion is defined as rotation about an axis by $$\theta = M$$ angles, followed by translation along the same axis by $$d=h\theta$$.

• Pitch: $$h:=d/\theta$$. If $$h=\infty$$ then it represent pure translation by $$M$$
• Axis: $$l=\{q+\lambda \omega : \lambda \in \mathbb{R}\}$$, where $$q$$ is a point on the axis and $$\omega$$ is the direction
• Magnitude: $$M$$

Represent screw motion in homogeneous coordinates: $g = \begin{bmatrix}e^{\widehat{\omega} \theta} & \left(I-e^{\widehat{\omega} \theta}\right) q+h \theta \omega \\0 & 1\end{bmatrix}$ Compared with $$\eqref{gexp}$$, if we choose $$v=-\omega\times q + h\omega$$, then $$\xi=(v,\omega)$$ generates the same screw motion.

### Twist to Screw

Given twist $$\xi = (v, \omega)$$, the screw coordinates are:

• Pitch: $$h=\dfrac{\omega^{T} v}{\|\omega\|^{2}}$$
• Axis: $$l=\left\{\begin{array}{ll} \frac{\omega \times v}{\|\omega\|^{2}}+\lambda \omega: \lambda \in \mathbb{R}, & \text { if } \omega \neq 0 \\ 0+\lambda v: \lambda \in \mathbb{R}, & \text { if } \omega=0 \end{array}\right.$$
• Magnitude: $$M=\left\{\begin{array}{ll} \|\omega\|, & \text { if } \omega \neq 0 \\ \|v\|, & \text { if } \omega=0 \end{array}\right.$$

### Screw to Twist

If $$h=\infty$$. Let $$l=\{q + \lambda v: \|v\| = 1\}$$, $$\theta = M$$, the twist is $$\widehat{\xi} = \begin{bmatrix}0 & v \\ 0 & 0\end{bmatrix}$$

if $$h\ne\infty$$. Let $$l=\{q + \lambda \omega: \|\omega\| = 1\}$$, $$\theta=M$$, the twist is $$\widehat{\xi} = \begin{bmatrix}\widehat{\omega} & -\omega\times q + h\omega \\ 0 & 0\end{bmatrix}$$

## Velocity of a Rigid Body

### Rotational Velocity

From $$q_a(t)=R_{ab}(t) q_b$$, we can get the velocity of the point in spatial coordinates: $v_{q_{a}}(t)=\frac{d}{d t} q_{a}(t)=\dot{R}_{a b}(t) q_{b} = \dot{R}_{a b}(t) R_{a b}^{-1}(t) R_{a b}(t) q_{b}$ Define spatial angular velocity $$\widehat{\omega}_{ab}^s$$ and body angular velocity $$\widehat{\omega}_{ab}^b$$ as $\widehat{\omega}_{ab}^s := \dot{R}_{ab} R_{ab}^{-1}, \qquad \widehat{\omega}_{ab}^b := R_{ab}^{-1} \dot{R}_{ab}$ And the velocity becomes $v_{q_{a}}(t)=\widehat{\omega}_{a b}^{s} R_{a b}(t) q_{b}=\omega_{a b}^{s}(t) \times q_{a}(t) \\v_{q_b}(t) = R_{ab}^T(t) v_{q_a}(t) = \omega_{ab}^b(t) \times q_b$

### Rigid Body Velocity

Similar to rotational velocity, from $$q_a(t) = g_{ab}(t) q_b$$ we have $v_{q_a} = \frac{d}{dt}q_a(t) = \dot{g}_{ab} q_b = \dot{g}_{ab} g_{ab}^{-1} q_a$ Define the spatial velocity $$\widehat{V}_{ab}^s \in se(3)$$ and body velocity $$\widehat{V}_{ab}^b \in se(3)$$ $\widehat{V}_{a b}^{s}=\dot{g}_{a b} g_{a b}^{-1}, \qquad V_{a b}^{s} =\begin{bmatrix}v_{a b}^{s} \\\omega_{a b}^{s}\end{bmatrix}=\begin{bmatrix}-\dot{R}_{a b} R_{a b}^{T} p_{a b}+\dot{p}_{a b} \\\left(\dot{R}_{a b} R_{a b}^{T}\right)^{\vee}\end{bmatrix} \\\widehat{V}_{a b}^{b}=g_{a b}^{-1}\dot{g}_{a b}, \qquad V_{a b}^{b} =\begin{bmatrix}v_{a b}^{b} \\\omega_{a b}^{b}\end{bmatrix}=\begin{bmatrix}R_{a b}^{T} \dot{p}_{a b} \\\left(R_{a b}^{T} \dot{R}_{a b}\right)^{\vee}\end{bmatrix} \\$

And the velocity becomes $v_{q_{a}}=\widehat{V}_{a b}^{s} q_{a}=\omega_{a b}^{s} \times q_{a}+v_{a b}^{s} \\v_{q_b} = g_{ab}^{-1} v_{q_a} = \widehat{V}_{ab}^b q_b = \omega_{a b}^{b} \times q_{b}+v_{a b}^{b}$

• $$\omega_{ab}^s$$: angular velocity of the body viewed in the spatial frame
• $$v_{ab}^s$$: velocity of a point on the body which is traveling through the origin of the spatial frame
• $$\omega_{ab}^b$$: angular velocity of the coordinate frame viewed in the current body frame
• $$v_{ab}^b$$: velocity of the origin of the body frame viewed in the current body frame

The spatial and body velocity are related by $\omega_{a b}^{s}=R_{a b} \omega_{a b}^{b} \\v_{a b}^{s}=-\omega_{a b}^{s} \times p_{a b}+\dot{p}_{a b}=p_{a b} \times\left(R_{a b} \omega_{a b}^{b}\right)+R_{a b} v_{a b}^{b}$ Define the adjoint transformation associated with $$g$$ as $\mathrm{Ad}_g = \begin{bmatrix}R & \widehat{p} R \\0 & R\end{bmatrix}$ Then we have $$V_{ab}^s = \mathrm{Ad}_g V_{ab}^b$$

### Velocity of a Screw

\begin{aligned}\widehat{V}_{ab}^s &= \dot{g}_{ab}(\theta) g_{ab}^{-1}(\theta) \\&= \frac{d}{dt}\left(e^{\widehat{\xi}\theta} g_{ab}(0)\right) \left(g_{ab}^{-1}(0) e^{-\widehat{\xi}\theta}\right) \\&= \widehat{\xi}\dot{\theta}\end{aligned}

\begin{aligned}\widehat{V}_{a b}^{b} &=g_{a b}^{-1}(\theta) \dot{g}_{a b}(\theta) \\&=\left(g_{a b}^{-1}(0) e^{-\widehat{\xi} \theta}\right)\left(e^{\widehat{\xi} \theta} \widehat{\xi} \dot{\theta} g_{a b}(0)\right) \\&=\left(g_{a b}^{-1}(0) \hat{\xi} g_{a b}(0)\right) \dot{\theta} \\&=\left(\operatorname{Ad}_{g_{a b}^{-1}(0)} \xi\right)^{\wedge} \dot{\theta}\end{aligned}

==Note== that $$\frac{d}{dt}e^{A(t)} = \dot{A}(t) e^{A(t)} = e^{A(t)}\dot{A}(t)$$ iff $$A$$ and $$\dot{A}$$ commute

### Coordinate Transformation

Transformation of spatial velocity: $V_{a c}^{s}=V_{a b}^{s}+\mathrm{Ad}_{g_{a b}} V_{b c}^{s}$ Transformation of body velocity: $V_{a c}^{b}=\mathrm{Ad}_{g_{b c}^{-1}} V_{a b}^{b}+V_{b c}^{b}$ Important properties: $V_{a b}^{b}=-V_{b a}^{s} \\V_{a b}^{b}=-\mathrm{Ad}_{g_{b a}} V_{b a}^{b}$

## Wrenches

Define wrench as a force/moment pair $F = \begin{bmatrix}f \\ \tau\end{bmatrix} \in \mathbb{R}^6$ Let $$B$$ be a coordinate frame attached to a rigid body, then write $$F_b = (f_b, \tau_b)$$ for a wrench applied at the origin of $$B$$.

The infinitesimal work can be represented as $\delta W = V_{ab}^b \cdot F_b$ Now consider another frame $$C$$ that is stationary w.r.t. $$B$$, the work should be the same: $V_{ac}^b \cdot F_c = V_{ab}^b \cdot F_b = (\mathrm{Ad}_{g_{bc}} V_{ac}^b)^T F_b = V_{ac}^b \cdot \mathrm{Ad}_{g_{bc}}^T F_b$ so that $F_c = \mathrm{Ad}_{g_{bc}}^T F_b \\\begin{bmatrix}f_c \\ \tau_c\end{bmatrix} = \begin{bmatrix}R_{bc}^T & 0 \\ -R_{bc}^T \widehat{p}_{bc} & R_{bc}^T\end{bmatrix} \begin{bmatrix}f_b \\ \tau_b\end{bmatrix}$

# Manipulator Kinematics

## Forward Kinematics

Consider open-chain manipulators with base frame $$S$$ and tool frame $$T$$ connected by a series of revolute or prismatic joints. The joint (configuration) space $$Q$$ of a manipulator consists of all possible values of the joint variables the robot.

The forward kinematics map $$g_{st}: Q \rightarrow SE(3)$$ is given by $\begin{equation}g_{s t}(\theta)=e^{\widehat{\xi}_{1} \theta_{1}} e^{\widehat{\xi}_{2} \theta_{2}} \cdots e^{\widehat{\xi}_{n} \theta_{n}} g_{s t}(0)\label{fk}\end{equation}$ where $$\xi_i$$ must be numbered sequentially starting from the base

## Manipulator Jacobian

Since $$g: \mathbb{R}^n \rightarrow SE(3)$$ is a matrix-valued function, the Jacobian $$\frac{\partial g}{\partial \theta}$$ cannot be easily obtained. Instead we derive it from the twist notation.

### End-Effector Velocity

\begin{aligned}\widehat{V}_{st}^s &= \dot{g}_{st}(\theta) g_{st}^{-1}(\theta) \\&= \sum_{i=1}^n \left( \frac{\partial g_{st}}{\partial \theta_i} \dot{\theta}_i \right) g_{st}^{-1}(\theta) \\&= \sum_{i=1}^n \left( \frac{\partial g_{st}}{\partial \theta_i} g_{st}^{-1}(\theta) \right) \dot{\theta}_i\end{aligned}

It can be written as $V_{st}^s = J_{st}^s(\theta) \dot{\theta} \\J_{st}^s(\theta) = \left[ \left(\frac{\partial g_{st}}{\partial \theta_1} g_{st}^{-1} \right)^\vee \dots \left(\frac{\partial g_{st}}{\partial \theta_n} g_{st}^{-1} \right)^\vee \right]$ where $$J_{st}^s(\theta) \in \mathbb{R}^{6\times n}$$ is called the spatial manipulator Jacobian

If we represent the forward kinematics as $$\eqref{fk}$$ then \begin{aligned}\left(\frac{\partial g_{s t}}{\partial \theta_{i}}\right) g_{s t}^{-1} &=e^{\widehat{\xi}_{1} \theta_{1}} \cdots e^{\widehat{\xi}_{i-1} \theta_{i-1}} \frac{\partial}{\partial \theta_{i}}\left(e^{\widehat{\xi}_{i} \theta_{i}}\right) e^{\widehat{\xi}_{i+1} \theta_{i+1}} \cdots e^{\widehat{\xi}_{n} \theta_{n}} g_{s t}(0) g_{s t}^{-1} \\&=e^{\widehat{\xi}_{1} \theta_{1}} \cdots e^{\widehat{\xi}_{i-1} \theta_{i-1}}\left(\widehat{\xi}_{i}\right) e^{\widehat{\xi}_{i} \theta_{i}} \cdots e^{\widehat{\xi}_{n} \theta_{n}} g_{s t}(0) g_{s t}^{-1} \\&=e^{\widehat{\xi}_{1} \theta_{1}} \cdots e^{\widehat{\xi}_{i-1} \theta_{i-1}}\left(\widehat{\xi}_{i}\right) e^{-\widehat{\xi}_{i-1} \theta_{i-1}} \cdots e^{-\widehat{\xi}_{1} \theta_{1}}\end{aligned} Converting to twist coordinates: $\left(\frac{\partial g_{s t}}{\partial \theta_{i}} g_{s t}^{-1}\right)^{\vee}=\operatorname{Ad}_{\left(e^{\hat{\xi}_{1} \theta_{1}} \dots e^{\widehat{\xi}_{i-1} \theta_{i-1}}\right)} {\xi_{i}}$ The spatial manipulator Jacobian becomes $J_{st}^s(\theta) = \begin{bmatrix}\xi_1 & \xi_2' & \dots & \xi_n'\end{bmatrix} \\\xi_i' = \operatorname{Ad}_{\left(e^{\hat{\xi}_{1} \theta_{1}} \dots e^{\widehat{\xi}_{i-1} \theta_{i-1}}\right)} {\xi_{i}}$ which means that the $$i$$-th column of the spatial Jacobian is the $$i$$-th joint twist, transformed to the current manipulator configuration.

The body manipulator Jacobian can be defined similarly: $J_{st}^b(\theta) = \begin{bmatrix}\xi_1^\dagger & \dots & \xi_{n-1}^\dagger & \xi_n^\dagger\end{bmatrix} \\\xi_i^\dagger = \operatorname{Ad}^{-1}_{\left(e^{\hat{\xi}_{i} \theta_{i}} \dots e^{\widehat{\xi}_{n} \theta_{n}} g_{st}(0)\right)} {\xi_{i}}$ The columns of $$J_{st}^b$$ correspond to the joint twists written w.r.t. the tool frame at the current configuration.

The spatial and boy Jacobians are related by an adjoint transformation: $J_{st}^s(\theta) = \operatorname{Ad}_{g_{st}(\theta)} J_{st}^b(\theta)$ The manipulator Jacobian (when invertible) can be used to move a robot without calculating the inverse kinematics by $\dot{\theta}(t) = [J_{st}^s(\theta)]^{-1} V_{st}^s(t)$

### Singularities

A singular configuration of a robot manipulator is a configuration at which the manipulator Jacobian drops rank.

4 common singularity cases:

Two collinear revolute joints: There exist two revolute joints with twists $\xi_1 = \begin{bmatrix}-\omega_1 \times q_1 \\\omega_1\end{bmatrix} \qquad\xi_2 = \begin{bmatrix}-\omega_2 \times q_2 \\\omega_2\end{bmatrix}$ with the following conditions:

1. The axes are parallel: $$\omega_1 = \pm \omega_2$$
2. The axes are collinear: $$\omega_i \times(q_1 - q_2) = 0$$

Three parallel coplanar revolute joint axes

1. The axes are parallel: $$\omega_i = \pm \omega_j$$ for $$i,j=1,2,3$$
2. The axes are coplanar: there exists $$n$$ s.t. $$n^T \omega_i = 0$$ and $$n^T(q_i - q_j) = 0$$ for $$i,j=1,2,3$$

Four intersecting revolute joint axes: There exists a point $$q$$ s.t. $$\omega_i \times(q_i - q) = 0$$ for $$i=1,2,3,4$$

Four parallel joint axes: The axes are parallel: $$\omega_i = \pm\omega_j$$ for $$i=1,2,3,4$$

### Manipulability

The manipulability of a robot describes its ability to move freely in all directions in the workspace.

1. The ability to reach a certain position or set of positions
2. The ability to change the position or orientation at a given configuration

Manipulability measure:

1. Minimum singular value: $$\sigma_\min(J)$$
2. Inverse of the condition number: $$\sigma_\min(J) / \sigma_\max(J)$$
3. Determinant: $$\det J$$
]]>
<p>EN530.646 Review. Based on <em><a href="https://www.cds.caltech.edu/~murray/books/MLS/pdf/mls94-complete.pdf">A Mathematical Introduction to Robotic Manipulation</a></em> book.</p>
Applied Optimal Control https://silencial.github.io/optimal-control/ 2020-12-29T00:00:00.000Z 2021-02-13T00:00:00.000Z EN530.603 Review.

# Unconstrained Optimization

## Optimality Conditions

### Necessary Optimality Conditions

$\nabla f = 0 \qquad \text{First-order Necessary Conditions} \\\nabla^2 f \ge 0 \qquad \text{Second-order Necessary Conditions}$

### Sufficient Optimality Conditions

$\nabla f(x^*) = 0, \quad \nabla^2 f(x^*) > 0$

## Numerical Solution

$x^{k+1} = x^k + \alpha^k d^k, \quad k=0,1,\dots$

where $$d^k$$ is called the search direction and $$\alpha^k > 0$$ is called the stepsize. The most common methods for finding $$\alpha^k$$ and $$d^k$$ are gradient-based.

1. Choose direction $$d^k$$ so that whenever $$\nabla f(x^k) \ne 0$$ we have $\nabla f(x^k)^T d^k < 0$

2. Choose stepsize $$\alpha^k > 0$$ so that $f(x^k + \alpha d^k) < f(x^k)$

### Search Direction

Many gradient methods are specified in the form $x^{k+1} = x^k - \alpha^k D^k \nabla f(x^k)$ where $$D^k \in \mathbb{S}^n_{++}$$

Steepest Descent: $D^k = I$ Newton's Method: $D^k = [\partial^2 f(x^k)]^{-1}$ Gauss-Newton Method: When the cost has a special least squares form $f(x) = \frac{1}{2} \|g(x)\|^2$ we can choose $D^k = \left[ \nabla g(x^k) \nabla g(x^k)^T \right]^{-1}$ Conjugate-Gradient Method: Choose linearly independent (conjugate) search directions $$d^k$$ $d^k = -\nabla f(x^k) + \beta d^{k-1}$ The most common way to compute $$\beta^k$$ is $\beta^{k}=\frac{\nabla f\left(x^{k}\right)^{T}\left(\nabla f\left(x^{k}\right)-\nabla f\left(x^{k-1}\right)\right)}{\nabla f\left(x^{k-1}\right)^{T} \nabla f\left(x^{k-1}\right)}$

### Stepsize

Minimization Rule: Choose $$\alpha^k \in [0,s]$$ so that $$f$$ is minimized $f(x^k + \alpha^k d^k) = \min_{\alpha \in [0, s]} f(x^k + \alpha d^k)$ Successive Stepsize Reduction - Armijo Rule:

1. Choose $$s>0, 0<\beta<1, 0<\sigma<1$$
2. Increase: $$m = 0, 1, \dots$$
3. Until: $$f(x^k) - f(x^k+\beta^m s d^k) \ge -\sigma \beta^m s \nabla f(x^k)^T d^k$$

where $$\beta$$ is the rate of decrease and $$\sigma$$ is the acceptance ratio

### Regularized Newton Method

Drawbacks of pure Newton's method:

• The inverse Hessian $$[\nabla^2 f(x)]^{-1}$$ might not be computable
• When $$\nabla^2 f(x) \nsucc 0$$ the method can be attracted by global maxima

Add a regularizing term to the Hessian and solve the system $(\nabla^2 f(x^k) + \Delta^k) d^k = -\nabla f(x^k)$ where $$\Delta^k$$ is chosen so that $$\nabla^2 f(x^k) + \Delta^k \succ 0$$

In trust-region methods one sets $\Delta^k = \delta^k I$ and $$(\nabla^2 f(x^k) + \delta^k I) d^k = -\nabla f(x^k)$$ is equivalent to $\DeclareMathOperator*{\argmin}{\arg\min}\DeclareMathOperator*{\argmax}{\arg\max}d^k \in \argmin_{\|d\| \le \gamma^k} f^k(d)$

# Constrained Optimization

## Equality Constraints

Minimize $$L(x): \mathbb{R}^n \rightarrow \mathbb{R}$$, subject to $$k$$ constraints $$f(x) = 0$$.

### Necessary Optimality Conditions

Let $$H(x, \lambda) = L(x) + \lambda^T f(x)$$ be the Hamiltonian, where $$\lambda$$ are scalars and called the Lagrangian multipliers $\nabla_x H(x^*, \lambda^*) = 0 \qquad \text{First-order Necessary Condtions} \\dx^T [\nabla_{xx} H(x^*, \lambda^*)] dx \ge 0 \qquad \text{Secon-order Necessary Condtions}$ Note that $$dx$$ is not arbitrary, i.e. we require $$\nabla f_i(x^*)^T dx = 0$$

Geometric meaning: any feasible $$dx$$ must be spanned by the gradients $$\nabla f_i$$. At an optimum $$x^*$$ we must also have $$\nabla L (x^*)^T dx = 0$$, so that $$\nabla L(x)$$ must be spanned by gradients as well: $\nabla L(x^*) = \sum_{i=1}^n \lambda_i \nabla f_i(x^*)$

### Sufficient Optimality Conditions

$\nabla_x H = 0, \quad dx^T [\nabla_{xx} H] dx > 0$

## Inequality Constraints

Minimize $$L(x)$$ subject to $$f(x) \le 0$$.

Now the Lagrangian multipliers have to satisfy $\lambda = \begin{cases}\ge 0, \quad f(x)=0 \\=0, \quad f(x) < 0\end{cases}$

Geometric meaning: the gradient of $$L$$ w.r.t. $$x$$ at a minimum must be pointed in a way that decrease of $$L$$ can only come by violating the constraints: $-\nabla L = \sum_{i=1}^n \lambda_i \nabla f_i \qquad (\lambda_i \ge 0)$

# Trajectory Optimization

Solving the optimal control problems $\begin{array}{ll}\text{minimize} &J(x(t), u(t), t_f) = \displaystyle\int_{t_0}^{t_f} L(x(t), u(t), t) dt \\\text{subject to} &\dot{x}= f(x,u,t) \\&\text{other constratins}\end{array}$ The cost $$J$$ is called a functional, i.e. a function of functions. $\text{differential of a function}\quad \nabla g = 0 \quad \Longleftrightarrow \quad \text{variation of a functional}\quad \delta J = 0$

## Euler-Lagrange Equation

Consider $J = \int_{t_0}^{t_f} g(x(t), \dot{x}(t)) dt$ given $$x(t_0)$$

The variation is $\delta J=\int_{t_{0}}^{t_{f}}\left[g_{x}(x, \dot{x}) \delta x-\frac{d}{d t} g_{\dot{x}}(x, \dot{x}) \delta x\right] dt+g_{\dot{x}}(x(t_{f}), \dot{x}(t_{f})) \delta x(t_{f})$ Fixed boundary conditions: if $$x(t_f)$$ is given, $$\delta J=0$$ is equivalent to $g_x(x, \dot{x}) - \frac{d}{dt}g_{\dot{x}}(x, \dot{x}) = 0$ Free boundary conditions: if $$x(t_f)$$ is not fixed, $$\delta J=0$$ is equivalent to $g_x(x, \dot{x}) - \frac{d}{dt}g_{\dot{x}}(x, \dot{x}) = 0 \\g_{\dot{x}}(x(t_f), \dot{x}(t_f)) = 0$

## Free Final-Time

When $$t_f$$ is allowed to vary \begin{aligned}\delta J=& \int_{t_{0}}^{t_{f}}\left[g_{x}(x^{*}, \dot{x}^{*}, t) \delta x-\frac{d}{d t} g_{\dot{x}}(x^{*}, \dot{x}^{*}, t) \delta x\right] d t \\&+g_{\dot{x}}(x^{*}(t_{f}), \dot{x}^{*}(t_{f}), t) \delta x(t_{f})+g(x^{*}(t_{f}), \dot{x}^{*}(t_{f}), t_{f}) \delta t_{f} \\=& \int_{t_{0}}^{t_{f}}\left[g_{x}(x^{*}, \dot{x}^{*}, t) \delta x-\frac{d}{d t} g_{\dot{x}}(x^{*}, \dot{x}^{*}, t) \delta x\right] d t \\&+\left(g_{\dot{x}}(x^{*}, \dot{x}^{*}, t) \delta x_{f}+ [g(x^{*}, \dot{x}^{*}, t_{f}) - g_{\dot{x}}(x^{*}, \dot{x}^{*}, t_{f})\dot{x}] \delta t_{f}\right)_{t=t_f} \\\end{aligned}

where $$\delta x_f$$ is the total space-time variation defined as $\delta x_f = \delta x(t_f) + \dot{x}(t_f)\delta t_f$ Unrelated $$t_f$$ and $$x(t_f)$$: the additional necessary conditions are $g_{\dot{x}}(t_f) = 0 \\g(t_f) = 0$ Function related: when $$x(t_f) = \Theta(t_f)$$, then $$\delta x_f = \frac{d\Theta}{dt} \delta t_f$$, the additional necessary conditions becomes $g_{\dot{x}}(t_f) \left[\frac{d\Theta}{dt} - \dot{x}^*\right]_{t=t_f} + g(t_f) = 0$

## Differential Constraints

$\begin{array}{cl}\text{minimize} & \displaystyle J = \int_{t_0}^{t_f} g(x, \dot{x}, t) dt \\\text{subject to} & f(x(t), \dot{x}(t), t) = 0\end{array}$

where $$t_f$$ and $$x(t_f)$$ are fixed. Define the Lagrangian multipliers $$\lambda : [t_0, t_f] \rightarrow \mathbb{R}^n$$ and the augmented cost $J_a = \int_{t_0}^{t_f} g_a dt \\g_a = g + \lambda^T f$ The necessary optimality conditions are the Euler-Lagrange equations w.r.t. $$g_a$$

## General Boundary Constraints

$\begin{array}{cl}\text{minimize} & \displaystyle J = \varphi(x(t_f), t_f) + \int_{t_0}^{t_f} g(x, \dot{x}, t) dt \\\text{subject to} & \psi(x(t_f), t_f) = 0\end{array}$

where $$t_f$$ is free. Defined the augmented cost $J_a = \omega(x(t_f), \nu, t_f) + \int_{t_0}^{t_f} g(x, \dot{x}, t) dt \\\omega(x(t_f), \nu, t_f) = \varphi(x(t_f), t_f) + \nu^T \psi(x(t_f), t_f)$ The necessary optimality conditions are \begin{aligned}&\nabla_x \omega(x(t_f), \nu, t_f) + \nabla_{\dot{x}} g(x(t_f), \dot{x}(t_f), t_f) = 0 \qquad (\delta x_f) \\&\frac{\partial}{\partial t_f} \omega(x(t_f), \nu, t_f) + g(x(t_f), \dot{x}(t_f), t_f) - \nabla_{\dot{x}} g(x(t_f), \dot{x}(t_f), t_f)^T \dot{x}(t_f) = 0 \qquad (\delta t_f) \\&\psi(x(t_f), t_f) = 0 \qquad (\text{constraints}) \\&\nabla_x g(x,\dot{x}, t) - \frac{d}{dt}\nabla_{\dot{x}}g(x,\dot{x},t) = 0, \quad t\in (t_0, t_f) \qquad (\delta x(t))\end{aligned}

# Continuous Optimal Control

$\begin{array}{cl}\text{minimize} & \displaystyle J = \varphi(x(t_f), t_f) + \int_{t_0}^{t_f} L(x, u, t) dt \\\text{subject to} & \psi(x(t_f), t_f) = 0 \\& \dot{x} = f(x, u, t)\end{array}$

Define the augmented cost $J_a = \varphi(t_f) + \nu^T \psi(t_f) + \int_{t_0}^{t_f} [H(x,u,\lambda,t) -\lambda^T \dot{x}] dt \\H = L(x,u,t) + \lambda^T(t)f(x,u,t)$ The necessary optimality conditions are \text{Euler-Lagrange:} \qquad \begin{aligned}&\dot{x} = f(x,u,t) \qquad (\text{dynamics}) \\&\dot{\lambda} = -\nabla_x H \qquad (\delta x(t)) \\\end{aligned}

$\text{Control Opimization:}\qquad \nabla_u H = 0$

\text{Transversality Conditions (TC):}\qquad \begin{aligned}&\psi(x(t_f), t_f) = 0 \qquad(\text{boundary constraint})\\&\lambda(t_f) = \nabla_x \varphi(t_f) + \nabla_x \psi(t_f)\cdot \nu \qquad (\delta x_f) \\&(\partial_t \varphi + \nu^T \partial_t \psi + L + \lambda^T f)_{t=t_f} = 0 \qquad (\delta t_f)\end{aligned}

## Hamiltonian Conservation

When the Hamiltonian does not depend on time ($$f$$ and $$L$$ do not depend on time): $\partial_t H(x,u,\lambda, t) = 0$ then $$H$$ is a conserved quantity along optimal trajectories \begin{aligned}\dot{H}(x,u,\lambda,t) &= \partial_x H \cdot \dot{x} + \partial_u H \cdot \dot{u} + \partial_\lambda H \cdot \dot{\lambda} + \partial_t H \\&= -\dot{\lambda}^T f(x,u,t) + 0 + f(x,u,t)^T \dot{\lambda} + 0 \\&= 0\end{aligned}

# Linear-Quadratic Regulator

Dynamics is linear and cost function is quadratic $\dot{x} = Ax + Bu \\J = \frac{1}{2}x^T(t_f) P_f x(t_f) + \int_{t_0}^{t_f} \frac{1}{2} [x(t)^T Q(t) x(t) + u(t)^T R(t) u (t)] dt$ where $$P_f, Q \in \mathbb{S}_{+}$$ and $$R\in\mathbb{S}_{++}$$

Using the optimality conditions to get $\begin{pmatrix}\dot{x} \\ \dot{\lambda}\end{pmatrix} = \begin{pmatrix}A & -BR^{-1}B^T \\-Q & -A^T\end{pmatrix} \begin{pmatrix}x \\ \lambda\end{pmatrix} \\u = -R^{-1}B^T\lambda \\\lambda(t_f) = P_f x(t_f) \\$ Kalman showed that $$\lambda(t)$$ are linear functions of the states $$\lambda(t) = P(t) x(t)$$, then we have the Riccati ODE $\dot{P} = -A^T P - PA + PBR^{-1}B^TP - Q, \qquad P(t_f) = P_f$ and the control is $u(t) = -R^{-1}B^TP(t)x(t) = -K(t) x(t)$

## Optimal Cost

\begin{aligned}J(t) &=\frac{1}{2} x^{T}\left(t_{f}\right) P_{f} x\left(t_{f}\right)+\int_{t}^{t_{f}} \frac{1}{2}\left[x^{T} Q x+u^{T} R u\right] \mathrm{d} t \\&=\frac{1}{2} x^{T}\left(t_{f}\right) P_{f} x\left(t_{f}\right)+\int_{t}^{t_{f}} \frac{1}{2}\left[x(t)^{T}\left(Q+K^{T} R K\right) x\right] \mathrm{d} t \\&=\frac{1}{2} x^{T}\left(t_{f}\right) P_{f} x\left(t_{f}\right)-\int_{t}^{t_{f}} \frac{d}{d t}\left(\frac{1}{2} x^{T} P x\right) \mathrm{d} t \\&=\frac{1}{2} x^{T}(t) P(t) x(t)\end{aligned}

## Trajectory Tracking

Consider the problem of not stabilizing to the origin, i.e. $$x\rightarrow 0$$ but tracking a given reference trajectory $$x_d(t)$$, i.e. $$x\rightarrow x_d$$.

Define the error state and control error $e = x - x_d \\v = u - u_d$ then \begin{aligned}\dot{e} &= \dot{x} - \dot{x}_d \\&= f(x_d+e, u_d+v) - f(x_d, u_d) \\&\approx \left.\frac{\partial f}{\partial x}\right|_{x=x_d} e + \left.\frac{\partial f}{\partial u}\right|_{u=u_d} u \\&= Ae + Bv\end{aligned} Use LQR to solve $$v = -K e$$ and the control is $u = -K(x - x_d) + u_d$

# Constrained Optimal Control

Consider optimal control problems subject to constraints $$|u(t)| \le 1$$

## Pontryagin's Minimum Principle

The optimal control must minimize the Hamiltonian: $H(x^*(t), u^*(t) + \delta u(t), \lambda^*(t), t) \ge H(x^*(t), u^*(t), \lambda^*(t), t)$

## Minimum-Time Problems

Consider a linear system $\dot{x} = Ax + Bu, \qquad x(0) = x_0$ with a single control $$u$$ constrained by $$|u(t)| \le 1$$ and is required to reach to origin $$x(t_f) = 0$$ in minimum time. The cost function is then $J = \int_{t_0}^{t_f} (1) dt$ So the Hamiltonian is $$H = 1+ \lambda^T(Ax + Bu)$$. According to the minimum principle, we have $u^* = +1, \quad \text{if } \lambda^TB < 0 \\u^* = -1, \quad \text{if } \lambda^TB > 0$ where $$\lambda^TB$$ is called the switching function. This is an example of bang-bang control since the control is always at its max or min.

In addition, to ensure smoothness in the trajectory during transitions we need $\lambda(t^-) = \lambda(t^+) \\H(t^-) = H(t^+)$ where $$t$$ is the time of transition. These conditions are called Weierstrass-Erdmann conditions

## Minimum Control Effort Problems

Consider a nonlinear system with affine controls defined by $\dot{x} = a(x(t),t) + B(x(t), t) u(t)$ where $$B$$ is a $$n\times m$$ matrix and $$|u_i(t)| \le 1$$ for $$i=1,\dots,m$$. The cost function is $J = \int_{t_0}^{t_f} \left(\sum_{i=1}^m |u_i(t)|\right) dt$ Express $$B$$ as $$[b_1 \,|\, b_2 \,|\, \cdots \,|\, b_n]$$ and assume the components of $$u$$ are independent of one another, from the minimum principle we have $u_i^* = \begin{cases}1, & \text{for } \lambda^{*T}b_i < -1 \\0, & \text{for } -1 < \lambda^{*T}b_i < 1 \\-1, & \text{for } 1 < \lambda^{*T}b_i \\\ge 0, & \text{for } \lambda^{*T}b_i = -1 \\\le 0, & \text{for } \lambda^{*T}b_i = 1\end{cases}$

## Singular Controls

Singular controls are controls that cannot be directly determined by either the optimality conditions or the minimum principle.

In the unconstrained case this generally occurs when $$\nabla_u H = 0$$ cannot be solved for $$u$$ which is caused by the conditions $$\nabla_u^2 H = 0$$, i.e. when the Hamiltonian is not convex.

Consider the LQR setting where $$R = 0$$, then the first order condition $\nabla_u H = B^T \lambda = 0$ which does not provide information about $$u$$. The solution is to consider higher-order derivative of $$H_u$$ until $$u$$ apperas explicitly: $\frac{d}{dt} \nabla_u H = B^T \dot{\lambda} = -B^T(Qx + A^T \lambda) = 0 \\\frac{d^2}{dt^2} \nabla_u H = B^T \dot{\lambda} = -B^T(Q(Ax + Bu) + A^T (Qx + A^T\lambda)) = 0$ which now provides enough information to obtain the singular control $$u$$

## General Constraints

Now consider general constraints $$c(x,u,t) \le 0$$. The Hamiltonian is defined by $H = L + \lambda^T f + \mu^T c, \qquad \text{where } \begin{cases}\mu \ge 0, \quad \text{if } c = 0 \\\mu = 0, \quad \text{if } c < 0\end{cases}$ The adjoint equations are $\dot{\lambda} = \begin{cases}-\nabla_x L - \nabla_x f^T \lambda, & c<0 \\-\nabla_x L - \nabla_x f^T \lambda - \nabla_x c^T \mu, & c=0\end{cases}$ The control is found by setting $$\nabla_u H = 0$$: $0 = \begin{cases}\nabla_u L + \nabla_u f^T \lambda, & c<0 \\\nabla_u L + \nabla_u f^T \lambda + \nabla_u c^T \mu, & c=0\end{cases}$

### Inequality Constraints on the State Only

Constraints with the form $$c(x(t), t) \le 0$$ is more difficult to handle and require differentiation until the control appears explicitly.

In general, if the constraint is differentiated $$q$$ times until $$u$$ shows up, then the Hamiltonian is defined as $H = L + \lambda^T f + \mu^T c^{(q)}$ In addition, the tangency constraints are enforced at time $$t$$ when $$c(x,t)$$ becomes active. $\begin{bmatrix}c(x,t) \\\dot{c}(c,t) \\\vdots \\c^{(q-1)}(x,t)\end{bmatrix} = 0$

# Dynamic Programming

## Discrete-Time DP

Consider a discrete optimization problem $\begin{array}{ll}\text{minimize} &J = \phi(x_N, t_N) + \sum_{i=0}^{N-1} L_i(x_i, u_i) \\\text{subject to} & x_{i+1} = f_i(x_i, u_i)\end{array}$ with given $$x(t_0)$$, $$t_0$$, $$t_N$$.

Define the cost from discrete stage $$i$$ to $$N$$ by $J_i \triangleq \phi(x_N, t_N) + \sum_{k=i}^{N-1} L_k(x_k, u_k)$ and the optimal cost-to-go function (or optimal value function) at stage $$i$$ as $V_i(x) \triangleq \min_{u_{i:N-1}} J_i$ Bellman equation: $V_i(x) = \min_u [L_i(x, u) + V_{i+1}(x')]$ with $$V_N(x) = \phi(x, t_N)$$

### Discrete-Time LQR

$x_{i+1} = A_i x_i + B_i u_i \\\phi(x) = \frac{1}{2} x^T P_f x, \qquad L_i(x,u) = x^TQ_ix + u^TR_iu$

Assume the value function is of the form $V_i(x) = \frac{1}{2}x^TP_ix$ for $$P_i \in \mathbb{S}_{++}$$ with boundary condition $$P_N = P_f$$. Using Bellman's principle we can get $u^{*}=-\left(R_{i}+B_{i}^{T} P_{i+1} B_{i}\right)^{-1} B_{i}^{T} P_{i+1} A_{i} x \equiv -K_{i} x$ Substituting $$u^*$$ back into the Bellman equation to obtain $x^{T} P_{i} x=x^{T}\left[K_{i}^{T} R_{i} K_{i}+Q_{i}+\left(A_{i}-B_{i} K_{i}\right)^{T} P_{i+1}\left(A_{i}-B_{i} K_{i}\right)\right] x$ This relationship can be cycled backward starting from $$P_N$$ to $$P_0$$.

It can also be expressed without gains $$K_i$$ according to $P_{i}=Q_{i}+A_{i}^{T}\left[P_{i+1}-P_{i+1} B_{i}\left(R_{i}+B_{i}^{T} P_{i+1} B_{i}\right)^{-1} B_{i}^{T} P_{i+1}\right] A_{i}$

## Continuous DP

Define the continuous value function $$V(x,t)$$ as $V(x(t), t)=\min _{u(t), t \in\left[t, t_{f}\right]}\left[\phi\left(x\left(t_{f}\right), t_{f}\right)+\int_{t}^{t_{f}} L(x(\tau), u(\tau), \tau) d \tau\right]$ Bellman equation can be expressed as $V(x(t), t)=\min_{u(t), t \in[t, t+\Delta]}\left[\int_{t}^{t+\Delta} L(x(\tau), u(\tau), \tau) \mathrm{d} \tau+V(x(t+\Delta), t+\Delta)\right]$ Expand $$V(x(t+\Delta), t+\Delta)$$ according to $V(x(t+\Delta), t+\Delta) = V(x(t), t) + [\partial_t V(x(t), t) + \nabla_x V(x(t), t)^T \dot{x}] \Delta + o(\Delta)$ Substituting back into Bellman equation we can get the Hamilton-Jacobi-Bellman equation (HJB): $-\partial_{t} V(x, t)=\min_{u(t)}\left[L(x, u, t)+\nabla_{x} V(x, t)^{T} f(x, u, t)\right]$

# Numerical Methods for Optimal Control

General optimal control problem: \begin{aligned}\text{minimize}\quad &J=\phi\left(x(t_{f}), t_{f}\right)+\int_{t_{0}}^{t_{f}} L(x(t), u(t), t) d t \\\text{subject to}\quad &x\left(t_{0}\right)=x_{0}, \quad x\left(t_{f}\right) \text{ and } t_{f} \text{ free} \\&\dot{x}(t)=f(x(t), u(t), t) \\&c(x(t), u(t), t) \leq 0, \text { for all } t \in\left[t_{0}, t_{f}\right] \\&\psi\left(x\left(t_{f}\right), t_{f}\right) \leq 0\end{aligned}

Consider three families of numerical methods for solving this:

• dynamic programming: i.e. solution to the HJB equations
• indirect methods: base on calculus of variations, and Pontryagin's principle
• direct methods: based on a finite-dimensional representation

## Indirect Methods

Start with the necessary conditions (ignore the path constraints): \begin{aligned}&\dot{x}=f(x, u, t) \\&\dot{\lambda}=-\nabla_{x} H(x, u, \lambda, t) \\&u^{*}=\min_{u} H(x, u, \lambda, t) \\&\lambda\left(t_{f}\right)=\nabla_{x} \phi\left(x\left(t_{f}\right), t_{f}\right)+\nabla_{x} \psi\left(x\left(t_{f}\right), t_{f}\right)^{T} \nu \\&\left(\partial_{t} \phi+\nu^{T} \partial_{t} \psi+H\right)_{t=t_{f}}=0\end{aligned}

### Indirect Shooting

Shooting methods are based on integrating the EL equations forward from time $$t_0$$ to time $$t_f$$ using a starting guess and then satisfying the boundary and transversality conditions at time $$t_f$$ by formulating a root-finding problem solved by e.g. Newton-type method.

The optimization variables are $\lambda(t_0), \nu, t_f$ and the equations to be solved are $\left[\begin{array}{c}\psi\left(x\left(t_{f}\right), t_{f}\right) \\\lambda\left(t_{f}\right)-\left[\nabla_{x} \phi\left(x\left(t_{f}\right), t_{f}\right)+\nabla_{x} \psi\left(x\left(t_{f}\right), t_{f}\right)^{T} \nu\right] \\\left(\partial_{t} \phi+\nu^{T} \partial_{t} \psi+H\right)_{t=t_{f}}\end{array}\right]=0$

### Indirect Multiple-Shooting

Split the time-interval into segments and use indirect shooting.

First choose discrete time $$[t_0, t_1, \dots, t_N]$$ and let $$v_i=(x(t_i), \lambda(t_i))$$. Let $$\bar{v}_i$$ be the result of integrating the EL equations on $$[t_i, t_{i+1}]$$. The optimization variables are $\lambda(t_0), v_1, v_2, \dots, v_N, \nu, t_f$ and the equations are $\left[\begin{array}{c}\bar{v}_{0}-v_{1} \\\vdots \\\bar{v}_{N-1}-v_{N} \\\psi\left(x\left(t_{f}\right), t_{f}\right) \\\lambda\left(t_{f}\right)-\left[\begin{array}{c}\nabla_{x} \phi\left(x\left(t_{f}\right), t_{f}\right)+\nabla_{x} \psi\left(x\left(t_{f}\right), t_{f}\right)^{T} \nu\end{array}\right] \\\left(\partial_{t} \phi+\nu^{T} \partial_{t} \psi+H\right)_{t=t_{f}}\end{array}\right]=0$

## Direct Methods

Start with either discretizing time and solving a discrete-time optimal control problems, or by parametrizing the controls using a finite set of parameters. This then becomes a nonlinear programming (NLP) problem.

## Direct Shooting

Based on parametrizing the control $$u(t)$$ by a finite number of parameters $$p_k$$: $u(t) = \sum_{k=1}^M p_k B_k(t)$ where $$B_k(t)$$ are a set of basis functions.

The NLP variables are $$[p, t_f]$$ and the state can be obtained by forward integration of the dynamics.

### Direct Multiple-Shooting

Choose discrete times $$[t_0, t_1, \dots, t_N]$$, discrete trajectory $$[x_0, x_1, \dots, x_N]$$ and a discrete set of parametrized control $$[u(p_0), u(p_1), \dots, u(p_N)]$$. Then perform direct shooting step on each interval $$[t_i, t_{i+1}]$$ by integrating the dynamics from $$x_i$$ and obtaining $$\bar{x}_i$$

The NLP variables are $$p_0, \dots, p_{N-1}, x_1,\dots,x_N, t_f$$ and the equations to be solved are $\left[\begin{array}{c}\bar{x}_{0}-x_{1} \\\vdots \\\bar{x}_{N-1}-x_{N} \\\psi\left(x_{N}, t_{N}\right)\end{array}\right]=0$

# Optimal State Estimation

Linear model, Gaussian noise:

• Kalman filter (KF): linear discrete-time dynamics, linear measurement model, Gaussian noise
• Kalman-Bucy filter: same as KF but continuous-time dynamics

Nonlinear model, Gaussian noise:

• Extended Kalman filter (EKF): linearize dynamics and sensor model and apply KF
• Unscented Kalman filter (UKF): linearization is avoided during uncertainty propagation by propagating the principle axes of the uncertainty ellipsoid through the nonlinear dynamics and then reconstructing the updated ellipsoid; measurements are still processed through linearization

Nonlinear model, arbitrary noise:

• Particle filter (PF): states are approximated using weighted samples whose weights are updated by obtained measurements

## Least-Squares for Static Estimation

Consider the sensor model with $$x\in\mathbb{R}^n$$ and $$z\in\mathbb{R}^k$$ $z = Hx + v$ where $$v$$ is a random variable denoting the measurement error. The goal is find the optimal estimate $$\hat{x}$$ which can be accomplished by first defining the measurement estimate error $$e_z = z - H \hat{x}$$ and minimizing $J = \frac{1}{2}e_z^T e_z$

• Necessary condition: $$\hat{x} = (H^TH)^{-1}H^Tz$$
• The sufficient condition: $$\nabla^2J = H^TH \succ 0$$. This is satisfied when $$\operatorname{rank}(H) = n$$

Now assume there are some prior information about $$x$$ and $$v$$ $\mathbb{E}[x] = \hat{x}_0, \qquad \mathbb{E}[(x - \hat{x}_0)(x - \hat{x}_0)^T] = P_0 \\\mathbb{E}[v] = 0, \qquad \mathbb{E}[vv^T] = R$ where $$R$$ is a diagonal matrix. Then a meaningful cost function can be defined as $J(\hat{x})=\frac{1}{2}\left(\hat{x}-\hat{x}_{0}\right)^{T} P_{0}^{-1}\left(\hat{x}-\hat{x}_{0}\right)+\frac{1}{2}(z-H \hat{x})^{T} R^{-1}(z-H \hat{x})$

• Necessary condition: $$\hat{x}=\left(H^{T} R^{-1} H+P_{0}^{-1}\right)^{-1}\left(H^{T} R^{-1} z+P_{0}^{-1} \hat{x}_{0}\right)$$
• The sufficient condition: $$\nabla^2J = P_0^{-1} + H^TR^{-1}H \succ 0$$

The necessary condition can be expressed as \begin{aligned}\hat{x} &= \hat{x}_0 + (P_0^{-1} + H^TR^{-1}H)^{-1} H^T R^{-1} (z - H \hat{x}_0) \\&\triangleq \hat{x}_0 + PH^TR^{-1} (z - H \hat{x}_0) \\&\triangleq \hat{x}_0 + K^{-1} (z - H \hat{x}_0) \\\end{aligned} The matrix $$P$$ is actually the covariance matrix of the error in the estimate $$\hat{x}$$, i.e. $$P=\mathbb{E}[(\hat{x} - x)(\hat{x}- x)^T]$$.

The recursive least-squares algorithm:

1. Given prior: mean $$\hat{x}_0$$ and covariance $$P_0$$
2. For each new measurement $$z_i = H_ix+v_i$$, where $$v_i \sim \mathcal{N}(0, R_i)$$
1. $$\hat{x}_{i} =\hat{x}_{i-1}+K_{i}\left(z_{i}-H_{i} \hat{x}_{i-1}\right)$$
2. $$P_{i} =P_{i-1}-P_{i-1} H_{i}^{T}\left(H_{i} P_{i-1} H_{i}^{T}+R_{i}\right)^{-1} H_{i} P_{i-1}$$
3. $$K_{i} =P_{i} H_{i}^{T} R_{i}^{-1}$$

Here $$P_i$$ is updated using the matrix inversion lemma: \begin{aligned}P_i &= \left(P_{i-1}^{-1}+H_{i}^{T} R_{i}^{-1} H_{i}\right)^{-1} \\&= P_{i-1}-P_{i-1} H_{i}^{T}\left(H_{i} P_{i-1} H_{i}^{T}+R_{i}\right)^{-1} H_{i} P_{i-1}\end{aligned}

## Propagation of Uncertainty

Consider the restricted class of systems $\dot{x}(t) = f(x(t), u(t)) + L(t)w(t)$ where $$L(t)$$ is a given matrix and where the noise $$w(t)$$ evolves in continuous time. Assume $$w(t)$$ is uncorrelated in time, i.e. $\mathbb{E}[w(t)w(\tau)^T] = Q_c'(t) \delta(t - \tau)$

### Linear System

Consider linear systems with Gaussian noise $x_k = \Phi_{k-1}x_{k-1} + \Gamma_{k-1}u_{k-1} + \Lambda_{k-1}w_{k-1}$ where $$\mathbb{E}[w_k]=0$$, $$\mathbb{E}[w_k w_l^T] = Q_k'\delta_{kl}$$ and $$x_0 \sim \mathcal{N}(\hat{x}_0, P_0)$$

The goal is to propagate the mean and covariance: $\hat{x}_k = \mathbb{E}[x] = \Phi_{k-1}\hat{x}_{k-1} + \Gamma_{k-1}u_{k-1}$

\begin{aligned}P_k &= \mathbb{E}[(x_k - \hat{x}_k)(x_k-\hat{x}_k)^T] \\&= \Phi_{k-1}P_{k-1}\Phi_{k-1}^T + \Lambda_{k-1}Q_{k-1}'\Gamma_{k-1}^T\end{aligned}

The resulting update $$\left(\hat{x}_{k-1}, P_{k-1}\right) \rightarrow\left(\hat{x}_{k}, P_{k}\right)$$ are called a Gauss-Markov sequence, given by \begin{aligned}\hat{x}_{k} &=\Phi_{k-1} \hat{x}_{k-1}+\Gamma_{k-1} u_{k-1} \\P_{k} &=\Phi_{k-1} P_{k-1} \Phi_{k-1}^{T}+Q_{k-1} \\Q_{k} &= \Lambda_k Q_k' \Lambda_k'\end{aligned}

### Continuous to Discrete

Consider $\dot{x}(t)=F(t) x(t)+G(t) u(t)+L(t) w(t)$ We have $x(t_k) =\Phi(t_{k}, t_{k-1}) x(t_{k-1})+\int_{t_{k-1}}^{t_{k}} \Phi(t_{k}, \tau) [G(\tau) u(\tau)+L(\tau) w(\tau)] d\tau$ where $$\Phi(t_k, t_{k-1})$$ is the state transition matrix such that $\Phi(t_{k}, t_{k-1}) x(t_{k-1})=x(t_{k-1})+\int_{t_{k-1}}^{t_{k}}[F(\tau) x(\tau)] d \tau$ The mean evolves according to (assume the control is constant during the sampling interval) \begin{aligned}\hat{x}_{k} &= \Phi_{k-1} \hat{x}_{k-1}+\int_{t_{k-1}}^{t_{k}} \Phi(t_{k}, \tau) G(\tau) u(\tau) d \tau \\&= \Phi_{k-1} \hat{x}_{k-1}+\left[\int_{t_{k-1}}^{t_{k}} \Phi(t_{k}, \tau) G(\tau) d \tau\right] u_{k-1} \\&= \Phi_{k-1} \hat{x}_{k-1} + \Gamma_{k-1}u_{k-1}\end{aligned} The covariance evolves according to \begin{aligned}P_{k} &=\Phi_{k-1} P_{k-1} \Phi_{k-1}^{T}+\mathbb{E}\left\{\left[\int_{t_{k-1}}^{t_{k}} \Phi(t_{k}, \tau) L(\tau) w(\tau) d \tau\right]\left[\int_{t_{k-1}}^{t_{k}} \Phi(t_{k}, \alpha) L(\alpha) w(\alpha) d \alpha\right]^{T}\right\} \\&=\Phi_{k-1} P_{k-1} \Phi_{k-1}^{T}+\int_{t_{k-1}}^{t_{k}} \int_{t_{k-1}}^{t_{k}} \Phi(t_{k}, \tau) L(\tau) \underbrace{\mathbb{E}\left[w(\tau) w(\alpha)^{T}\right]}_{Q_{c}^{\prime} \delta(\tau-\alpha)} L(\alpha)^{T} \Phi(t_{k}, \alpha)^T d \tau d \alpha \\&=\Phi_{k-1} P_{k-1} \Phi_{k-1}^{T}+\int_{t_{k-1}}^{t_{k}} \Phi(t_{k}, \tau) L(\tau) Q_{c}^{\prime} L(\tau)^{T} \Phi(t_{k}, \tau)^T d \tau \\& \triangleq \Phi_{k-1} P_{k-1} \Phi_{k-1}^{T}+Q_{k-1}\end{aligned} where $$Q_{k-1}$$ can be approximated by $Q_{k-1} \approx L(t_{k-1}) Q_{c}^{\prime}(t_{k-1}) L(t_{k-1})^T \Delta t$

## Linear-Optimal Estimation

Combine measurement updates and uncertainty propagation to optimally estimate the state. Consider a discrete LTV model $\begin{array}{l}x_{k}=\Phi_{k-1} x_{k-1}+\Gamma_{k-1} u_{k-1}+\Lambda_{k-1} w_{k-1} \\z_{k}=H_{k} x_{k}+v_{k}\end{array}$ Estimating the state $$x_k$$ is performed by iterating between uncertainty propagation and measurement updates. The mean and covariance after an uncertainty propagation from step $$k-1$$ to step $$k$$ are denoted by $$(x_{k|k-1}, P_{k|k-1})$$. The mean and covariance after a measurement update at step $$k$$ are denoted by $$(x_{k|k}, P_{k|k})$$ Kalman filter:

Prediction: $\begin{array}{l}\hat{x}_{k | k-1}=\Phi_{k-1} \hat{x}_{k-1 | k-1}+\Gamma_{k-1} u_{k-1} \\P_{k | k-1}=\Phi_{k-1} P_{k-1 | k-1} \Phi_{k-1}^{T}+Q_{k-1}\end{array}$ Correction: $\begin{array}{l}\hat{x}_{k | k}=\hat{x}_{k | k-1}+K_{k}\left(z_{k}-H_{k} \hat{x}_{k | k-1}\right) \\P_{k | k}=P_{k | k-1}-P_{k | k-1} H_{k}^{T}\left(H_{k} P_{k | k-1} H_{k}^{T}+R_{k}\right)^{-1} H_{k} P_{k | k-1} \\K_{k}=P_{k | k} H_{k}^{T} R_{k}^{-1}\end{array}$

# Stochastic Control

$\begin{array}{ll}\text{minimize} &J(u(\cdot))=\mathbb{E}\left[\phi\left(x(t_{f}), t_{f}\right)+\int_{t_{0}}^{t_{f}} \mathcal{L}(x(\tau), u(\tau), \tau) d \tau\right] \\\text{subject to} & \dot{x}(t) = f(x(t), u(t), w(t), t)\end{array}$

where $$\mathbb{E}[w(t)]=0$$ and $$\mathbb{E}[w(t)w(\tau)^T] = W(t) \delta(t-\tau)$$ for $$W(t) \in \mathbb{S}_+$$

## Perfect Measurements

Consider $\dot{x}(t)=f(x(t), u(t))+L(t) w(t)$ The value function is defined as \begin{aligned}V(x(t), t)&=\min _{u(t), t \in[t, t_{f}]} \mathbb{E}\left[\phi\left(x(t_{f}), t_{f}\right)+\int_{t}^{t_{f}} \mathcal{L}(x(\tau), u(\tau), \tau) d \tau\right] \\&= \min_{u(t), t \in[t, t+\Delta t]} \mathbb{E}\left[\int_{t}^{t+\Delta t} \mathcal{L}(x(\tau), u(\tau), \tau) d \tau + V(x(t+\Delta t), t+\Delta t)\right]\end{aligned} Expand $$V(x,t)$$ to first order in $$\Delta t$$ according to \begin{aligned}V(x+\Delta x, t+\Delta t) &=V(x, t)+\partial_{t} V(x, t) \Delta t+\nabla_{x} V(x, t)^{T} \Delta x+\frac{1}{2} \Delta x^{T} \nabla_{x}^{2} V(x, t) \Delta x+o(\Delta t) \\&=V+\partial_{t} V \Delta t+\nabla_{x} V^{T}(f+L w) \Delta t+\frac{1}{2}(f+L w)^{T} \nabla_{x}^{2} V(f+L w) \Delta t^{2}+o(\Delta t)\end{aligned} Note that in the above it is necessary to expand $$V$$ to second order in $$\Delta x$$ since $$w$$ is of order $$1/\sqrt{\Delta t}$$

Substituting back to get the stochastic HJB equation $-\partial_t V(x, t) =\min_{u(t)}\left\{\mathcal{L}(x, u, t)+\nabla_{x} V(x, t)^{T} f(x, u)+\frac{1}{2} \operatorname{tr}\left[\nabla_{x}^{2} V(x, t) L(t) W(t) L(t)^{T}\right]\right\}$

### Continuous Linear-Quadratic Systems

$\dot{x} = Fx + Gu + Lw \\J = \frac{1}{2} \mathbb{E}\left\{x^T(t_f) S_f x(t_f) + \int_{t_0}^{t_f} \begin{bmatrix} x(t) \\ u(t) \end{bmatrix}^T\begin{bmatrix}Q(t) & M(t) \\M(t)^T & R(t)\end{bmatrix}\begin{bmatrix} x(t) \\ u(t) \end{bmatrix}\right\}$

with $$x(0)=x_0$$ and given $$t_0$$ and $$t_f$$.

To solve the stochastic HJB, consider the value function of the form $V(t)=\frac{1}{2} x^{T}(t) S(t) x(t)+v(t)$ where $$v(t)$$ is the stochastic value function increment defined by $v(t)=\frac{1}{2} \int_{t}^{t_{f}} \operatorname{tr}\left[S(\tau) L W L^{T}\right] d \tau$ Plug into the stochastic HJB equation and solve for $$u$$, and substitute back to get the following equations: $-\dot{S}=\left(F-G R^{-1} M^{T}\right)^{T} S+S\left(F-G R^{-1} M^{T}\right)+Q-S G R^{-1} G^{T} S-M R^{-1} M^{T} \\-\dot{v} = \frac{1}{2}\operatorname{tr}(SLWL^T)$

### Discrete Linear-Quadratic Systems

$x_{k+1} = F_k x_k + G_k u_k + L_k w_k \\J = \frac{1}{2} \mathbb{E}\left\{x_N^T S_N x_N + \sum_{k=0}^{N-1} \begin{bmatrix} x_k \\ u_k \end{bmatrix}^T\begin{bmatrix}Q & M \\M^T & R\end{bmatrix}\begin{bmatrix} x_k \\ u_k \end{bmatrix}\right\}$

with $$x(0)=x_0$$ and given $$t_0$$ and $$t_f$$.

Similarly use value function of the form $V_k = \frac{1}{2}x_k S_k x_k + v_k \\v_k = \frac{1}{2} \operatorname{tr}[S_{k+1} L W_k L^T] + v_{k+1}$ Plug into the stochastic HJB equation we can get $S_{k}=\left(F^{T} S_{k+1} F+Q\right)-\left(M^{T}+G^{T} S_{k+1} F\right)^{T}\left(R+G^{T} S_{k+1} G\right)^{-1}\left(M^{T}+G^{T} S_{k+1} F\right)$

]]>
<p><a href="https://asco.lcsr.jhu.edu/en530-603-f2020-applied-optimal-control/">EN530.603</a> Review.</p>

# 现场急救概述

## 概念与意义

• 实施地点：事发现场或转送医院途中
• 实施人员：受害者本人、第一目击者或医护人员
• 实施方法：最基本救护技术（CPR，体外除颤，止血包扎，搬运固定等）

1. 维持生命
2. 防止伤势或病情恶化
3. 促进康复

## 急救体位摆放

1. 去枕仰卧位：复苏体位。
2. 头高仰卧位：垫枕头。常用于意识清醒的脑卒中、中暑、头部外伤，恶心呕吐、大量口腔分泌物的伤者。
3. 头高侧卧位：与头高仰卧位类似，但有恶心呕吐、大量口腔分泌物的情况。
4. 头低侧卧位：恢复体位。有正常脉搏和呼吸、意识丧失、脊柱损伤的伤者。
5. 中凹卧位：休克体位。头和胸部抬高约 20°～30°，下肢抬高约15°～20°。
6. 半卧位：胸腹部创伤、呼吸困难但意识清醒的伤者。

1. 取出伤者口袋物品
2. 跪于伤者右侧
3. 将伤者右侧手臂弯向自己身体
4. 左侧手臂弯曲放于胸前
5. 一手抬起伤者左侧膝盖，一手扶左肩，面向自己翻成侧卧位
6. 将伤者左手手背垫于脸颊下

1. 松解衣物，摆好体位
2. 保持呼吸道通畅，减少搬动
3. 缓解紧张，调节体温
4. 清醒者可给予温热饮料或淡盐水，意识不清者应禁食

# 心肺复苏

## 概述

CPR 黄金四分钟：

1. 10 秒 —— 意识丧失突然倒地
2. 60 秒 —— 自主呼吸逐渐停止
3. 4 分钟 —— 开始出现脑水肿
4. 6 分钟 —— 开始出现脑死亡
5. 8分钟 —— 脑死亡

## 具体步骤

• 更为简化：只动“手”。（窒息性骤停如溺水、煤气中毒等或 8 岁以下病患例外）
• 顺序颠倒：先动“手”，后动“口”

• 一叫：呼叫，判断有无意识与呼吸
1. 拍打双肩
2. 大声呼唤
3. 观察胸廓起伏
• 二叫：呼救。（溺水、中毒、外伤、呼吸停止、8 岁以下的病患应先急救再求救）
• C：胸外按压
• 仰卧位
• 按压位置：两乳头连线中点，即胸骨中下 1/3 交界处。
• 按压手法：掌跟垂直按压
• 使胸骨下陷 5~6 cm，按压速率 100~120 次/分钟
• A：开放气道
1. 清理口腔异物：将伤者头部歪向一侧，大拇指压低伤者舌头和下颌，另一只手的食指在伤者咽侧壁绕一圈，顺势将异物钩出。
2. 仰头举颌法：避免舌后跟堵塞气道。一手外侧缘按压伤者额头使其后仰，另一手食指和中指抬起下颌，使其下颌与耳垂连线与地面垂直。
• B：人工呼吸：一捏二罩三吹气。

## 成人、儿童、婴儿

• 胸外按压使用单手，另一只手可以按压额头开放气道。
• 开放气道时下颌与耳垂连线非垂直而是成 60°

• 意识判断的方法是拍足底
• 胸外按压使用手指
• 开放气道成 30°
• 人工呼吸时为口对口鼻
• 测量脉搏为上臂内测的肱动脉

# 自动体外除颤器

1. 开：打开盒子，启动电源
2. 贴：粘贴电极片。一片贴在右锁骨正下方，另一片贴在左乳下外侧
3. 插：插上插头，AED 可以开始自动分析心律（5～15秒）
4. 按：若 AED 分析有除颤指征，会提示按下放电键

==注意事项==：

1. 在分析心律和放电除颤时，不可触碰患者

2. 由于 AED 功率固定，因此不可用于 1 岁以下的婴儿
3. 在 CPR 同时应尽早使用 AED

# 海姆立克急救法

## 气道异物梗塞的识别与处理

• 不完全梗塞：剧烈呛咳、气急、能呼吸、有反应

处理：鼓励咳嗽，促进异物排出

• 完全梗塞：喉鸣、声嘶、吸气性呼吸困难，手呈 V 字形置于喉部，咳嗽无力

处理：海姆立克急救法

## 原理和步骤

### 成人及 1 岁以上的儿童

1. 冲击手法：用拇指一侧进行冲击
2. 冲击部位：剑突与肚脐之间的腹中部
3. 冲击方向：向上、向后、快速

1. 冲击手法：用掌跟进行冲击

1. 冲击部位：胸骨中下段

### 婴儿

1. 冲击部位：两肩胛骨中间部位
2. 冲击手法：向下拍击

1. 冲击部位：胸骨下半段，两乳头连线稍下方
2. 冲击手法：两指按压

# 止血

## 出血

• 毛细血管出血：渗出，色鲜红
• 静脉出血：缓慢流出，色暗红，不能自愈
• 动脉出血：喷射状，色鲜红，多经急救尚能止血

## 止血

• 压：压迫止血法
• 包：包扎止血法
• 塞：填塞止血法
• 捆：止血带止血法

### 指压止血法

• 头顶部出血： 拇指压迫颞浅动脉
• 眼面部出血：拇指压迫伤恻下颌骨与咬肌前缘交界处的面动脉
• 鼻出血：伤者坐下，头前倾，用口呼吸。拇指和食指压迫鼻唇沟与鼻翼相交的端点处 10 分钟，放松后未止血，再捏 10 分钟
• 前臂出血：患肢抬高用拇指压迫上臂肱二头肌内测沟中部搏动点，将肱动脉向外压向肱骨
• 手掌出血：将手抬高，用两手拇指分别压迫手腕的尺动脉和桡动脉
• 手指出血：将手抬高，用食指、拇指分别压迫手指根部两侧的指动脉
• 大腿出血：在伤者腹股沟终点稍下方，用两手拇指向后用力压股动脉
• 足部出血：用两手拇指分别压迫足背动脉和内踝与跟腱之间的胫后动脉
• 皮肤擦伤：常见损伤，出血量较小，可用流动自来水将伤口冲洗干净，直到伤口没有异物。在出血部位周围皮肤消毒后，用干净毛巾或其他软质布料做成的敷料覆盖伤口

• 快——动作快、抢时间
• 准——看准出血点备止血带
• 垫——不直接接触皮肤
• 上——扎在伤口上方
• 适——松紧要合适
• 标——标上红色标记
• 放——止血带定时放松

# 创伤骨折

## 概述

• 闭合性骨折
• 开放性骨折

• 完全性骨折
• 不完全性骨折
• 嵌顿性骨折

• 局部表现：
• 一般表现：疼痛、肿胀、功能障碍和压痛
• 特有体征：畸形、异常活动、骨擦音或骨擦感
• 全身表现：
• 休克，主要原因是出血
• 发热，低热为主（< 38 ᵒC，高热多考虑合并感染

## 骨折的急救

• 先处理窒息、出血及严重创伤等情况
• 周围环境安全就地处理
• 就地取材、包扎伤口，骨折固定
• 勿将突出的骨骼推回
• 注意保持体温和呼吸道通畅，预防休克
• 即刻送医并观察肤色、温度及脉搏

## 骨折固定

• 固定材料：木制、铁制、塑料制临时夹板
• 临时夹板：木板、木棒、树枝、竹竿等
• 无临时夹板可固定于伤员躯干或健肢上

• 夹板长度应超过两端关节
• 夹板与肢体间应加垫软物衬垫
• 在健侧或夹板侧打平结

# 日常意外紧急处置

## 中暑

• 先兆中暑：头痛、眼花、耳鸣、头晕、口渴、心悸、体温正常或略升高
• 轻症中暑：体温在 38 ᵒC 以上，面色红或苍白、大汗、皮肤湿冷、血压下降、脉搏增快
• 重症中暑：恶心、呕吐、瞳孔扩大、腹部或肢体痉孪、脉搏快，常伴有高热甚至意识丧失

1. 转移病人：脱离高温环境，迅速将病人移至通风处，就地平卧、解开衣扣，以利呼吸及散热
2. 物理降温：冷水或稀释的酒精擦浴，或用冷水毛巾或冰袋、冰块放在患者颈部、腋窝或大腿根部腹股沟处等大动脉血管部位，帮助患者散热
3. 使用药物
4. 按摩穴位：若病人昏迷不醒，则可用大拇指按压病人的人中、合谷、内关等穴位
5. 补充体液：温开水、淡盐水、鲜果汁
6. 拨打急救电话：立即将病人转送至医院，最好用空调车转送

## 一氧化碳中毒

1. 打开门窗，迅速将患者移到空气新鲜的地方
2. 松开衣扣，保持呼吸道畅通，迅速判断意识、呼吸、心跳
3. 及时清除气道分泌物和呕吐物，必要时进行胸外按压和人工呼吸
4. 拨打急救电话
5. 安静休息，避免活动后加重心、肺负担及增加氧的消耗量

## 狗咬伤

• 一律按照狂犬病的防治要求处理
• 就地立即彻底清洗和清理伤口

1. 冲洗伤口：使用一定压力的流动清水（自来水）冲伤口，用肥皂水或其它弱碱性清洁剂清洗伤口，交替至少 15 分钟。再用生理盐水将伤口洗净，然后用无菌脱脂棉将伤口处残留液吸尽。避免在伤口处残留肥皂水
2. 消毒伤口：彻底冲洗后用 2~3% 碘伏或 75% 酒精涂擦伤口。若伤口有坏死组织，应在彻底清除后再进行消毒处理
• 眼部：无菌生理盐水冲洗，一般不用任何消毒剂
• 口腔：冲洗时保持头低位，以免冲洗液流入咽喉部而造成窒息
• 外生殖器或肛门部黏膜：伤口处理、冲洗方法同皮肤，注意冲洗方向应当向外，避免污染深部黏膜
3. 不要包扎伤口，及时接种狂犬疫苗

# 道路交通事故伤害

]]>
<p><a href="https://www.icourse163.org/course/NCU-1001555029">现场生命急救知识与技能</a>笔记</p>
Algorithm II https://silencial.github.io/algorithms-2/ 2020-07-07T00:00:00.000Z 2020-07-13T00:00:00.000Z Review of Princeton Algorithms II course on Coursera.

Algorithms I Review

My solution to the homework Course Overview

topicdata structures and algorithms
data typesstack, queue, bag, union-find, priority queue
sortingquicksort, mergesort, heapsort
searchingBST, red-black BST, hash table
graphsBFS, DFS, Prim, Kruskal, Dijkstra
stringsradix sorts, tries, KMP, regexps, data compression
advancedB-tree, suffix array, maxflow

# Undirected Graphs

Terminology:

• Path: sequence of vertices connected by edges
• Cycle: path whose first and last vertices are the same Some graph problems:

• Path: Is there a path between $$s$$ and $$t$$
• Shortest path: What is the shortest path between $$s$$ and $$t$$
• Cycle: Is there a cycle in the graph
• Euler tour: Is there a cycle that uses each edge exactly once
• Hamilton tour: Is there a cycle that uses each vertex exactly once.
• Connectivity: Is there a way to connect all of the vertices?
• MST: What is the best way to connect all of the vertices?
• Biconnectivity: Is there a vertex whose removal disconnects the graph?
• Planarity: Can you draw the graph in the plane with no crossing edges
• Graph isomorphism: Do two adjacency lists represent the same graph?

## API

For vertices, convert them between names and integers with symbol table

For edges, we have 3 choices:

1. Maintain a list of the edges
2. Maintain a 2-D $$V$$-by-$$V$$ boolean matrix
3. Maintain a vertex-indexed array of lists

Comparison:

RepresentationSpaceAdd edgeEdge between v and wIterate over vertices adjacent to v
list of edgesE1EE
adjacency matrixV211V
adjacency listsE + V1degree(v)degree(v)

In practice, use adjacency lists because

• Algorithms based on iterating over vertices adjacent to $$v$$
• Real-world graphs tend to be sparse
1. Mark vertex $$v$$ as visited
2. Recursively visit all unmarked vertices adjacent to $$v$$

After DFS, can find vertices connected to $$s$$ in constant time (all true vertices in marked) and can find a path to $$s$$ in time proportional to its length (edgeTo is a parent-link representation of a tree rooted at $$s$$).

1. Remove vertex $$v$$ from queue
2. Add to queue all unmarked vertices adjacent to $$v$$ and mark them

BFS examines vertices in increasing distance from $$s$$. It can be used to find the shortest path.

## Connected Components

Definition:

• Vertices $$v$$ and $$w$$ are connected if there is a path between them
• A connected component is a maximal set of connected vertices

Use DFS to partition vertices into connected components, then can answer whether $$v$$ is connected to $$w$$ in constant time

# Directed Graphs ## API

Almost same as Graph:

Both DFS and BFS are digraph algorithms, code is identical to undirected graphs.

DFS search applications:

• Every program is a digraph. Vertex=basic block of instructions; edge=jump
• Dead-code elimination: find and move unreachable code
• Infinite-loop detection: whether exit is unreachable
• Every data structure is a digraph. Vertex=object; edge=reference
• Mark-sweep algorithm: collect unreachable objects as garbage

BFS search applications:

• Multiple-source shortest paths: initialize by enqueuing all source vertices
• Web crawler

## Topological Sort

Precedence scheduling: Given a set of tasks to be completed with precedence constraints, in which order should we schedule the tasks?

1. Represent data as digraph: vertex=task; edge=precedence constraint
2. Represent the problem as topological sort: redraw directed acyclic graph (DAG) so all edges point upwards

Run DFS and return vertices in reverse postorder.

## Strong Components

Definition:

• Vertices $$v$$ and $$w$$ are strongly connected if there is both a directed path from $$v$$ to $$w$$ and a directed path from $$w$$ to $$v$$
• A Strong component is a maximal subset of strongly-connected vertices

Applications:

• Food Web. Vertex=species; edge=from producer to consumer; strong component=subset of species with common energy flow
• Software module dependency graph. Vertex=software module; edge=from module to dependency; strong component=subset of mutually interacting modules

Kosaraju-Sharir algorithm:

1. Compute topological order (reverse postorder) in Reverse Graph $$G^R$$
2. Run DFS in $$G$$, visiting unmarked vertices in the order computed above

## HW6 Wordnet

Specification

• WordNet digraph: build digraph from input files. WordNet.java
• synsets.txt saves all the id and corresponding words ==(one id can have one or more words, and one word can have one or more ids)==
• hypernyms.txt contains hypernym relationships
• Shortest ancestral path: given two vertices, find the directed paths to their common ancestor with the max total length. SAP.java
• Outcast detection: given a list of WordNet nouns, find the noun least related to the others. This can be measured by the sum of SAP distances to all the other vertices. Outcast.java

# Minimum Spanning Trees

Definition: A spanning tree of $$G$$ is a subgraph $$T$$ that is both a tree (connected and acyclic) and spanning (includes all of the vertices)

Problem: Given undirected graph $$G$$ with positive edge weights, find the min weight spanning tree.

## Edge-Weighted Graph API

Edge abstraction needed for weighted edges.

Edge-weighted graph representation:

## Greedy Algorithm

Simplifications:

• Edge weights are distinct
• Graph is connected

Then MST exists and is unique.

Definitions:

• A cut in a graph is a partition of its vertices into two (nonempty) sets
• A crossing edge connects a vertex in one set with a vertex in the other.

Given any cut, the crossing edge of min weight is in the MST.

Algorithm:

1. Start with all edges colored white
2. Find cut with no black crossing edges; color its min-weight edge black
3. Repeat until $$V-1$$ edges are colored black

Remove simplifying assumptions:

• Greedy MST still correct if equal weights are present
• Compute MST of each component if graph is not connected

Efficient implementations: How to choose cut? How to find min-weight edge?

## Kruskal's Algorithm

1. Consider edges in ascending order of weight
2. Add next edge to tree $$T$$ unless doing so would create a cycle

The challenge is how to examine adding edge $$v-w$$ to tree $$T$$ create a cycle. Use union-find:

• Maintain a set for each connected component in $$T$$
• If $$v$$ and $$w$$ are in same set, then adding $$v-w$$ would create a cycle
• To add $$v-w$$ to $$T$$, merge sets containing $$v$$ and $$w$$

## Prim's Algorithm

1. Start with vertex $$0$$ and greedily grow tree $$T$$
2. Add to $$T$$ the min weight edge with exactly one endpoint in $$T$$
3. Repeat until $$V-1$$ edges

### Lazy Implementation

Maintain a PQ of edges with (at least) one endpoint in $$T$$, where the priority is the weight of edge

1. Delete-min to determine next edge $$e=v-w$$ to add to $$T$$
2. Disregard if both endpoints $$v$$ and $$w$$ are marked (both in $$T$$)
3. Otherwise, let $$w$$ be the unmarked vertex (not in $$T$$)
• add to PQ any edge incident to $$w$$
• add $$e$$ to $$T$$ and mark $$w$$

### Eager Implementation

Maintain a PQ of vertices connected by an edge to $$T$$, where priority of vertex $$v$$ = the weight of shortest edge connecting to $$T$$

1. Delete min vertex $$v$$ and add its associated edge $$v-w$$ to $$T$$
2. Update PQ by considering all edges $$v-x$$ incident to $$v$$
• ignore if $$x$$ is already in $$T$$
• add $$x$$ to PQ if not already in it
• update priority of $$x$$ if $$v-x$$ becomes shortest edge connecting $$x$$ to $$T$$

And we need a IndexMinPQ data structure

# Shortest Paths

Given an edge-weighted digraph, find the shortest path from $$s$$ to $$t$$.

AlgorithmRestrictionTypical caseWorst caseExtra space
Topological sortno directed cyclesE + VE + VV
Dijkstra (binary heap)no negative weightsE log VE log VV
Bellman-Fordno negative cyclesE VE VV
Bellman-Ford (queue-based)no negative cyclesE + VE VV

## API

Weighted directed edge API:

The Edge-weighted digraph API is the same as EdgeWeightedGraph.

Single-source shortest paths API:

## Edge Relaxation

Relax edge $$e=v \rightarrow w$$

• distTo[v] is length of shortest known path from $$s$$ to $$v$$
• distTo[w] is length of shortest known path from $$s$$ to $$w$$
• edgeTo[w] is last edge on shortest known path from $$s$$ to $$w$$
• If $$e = v \rightarrow w$$ gives shorter path to $$w$$ through $$v$$, update both distTo[w] and edgeTo[w]

Different ways to choose which edge to relax:

• Dijkstra's algorithm (nonnegative weights)
• Topological sort algorithm (no directed cycles)
• Bellman-Ford algorithm (no negative cycles)

## Dijkstra's Algorithm

1. Consider vertices in increasing order of distance from $$s$$
2. Add vertex to tree and relax all edges pointing from that vertex

Prim's algorithm is essentially the same algorithm, distinct in the way to choose next vertex for the tree:

• Prim's algorithm choose the closest vertex to the tree (via undirected edge)
• Dijkstra's algorithm choose the closest vertex to the source (via a directed path)

## Topological Sort

Suppose the edge-weighted digraph has no directed cycles

1. Consider vertices in topological order
2. Relax all edges pointing from that vertex

It is more efficient to get the SPT. Time is proportional to $$E + V$$

Applications:

• Seam carving: Resize an image without distortion
• Grid DAG: vertex = pixel; edge = from pixel to 3 downward neighbors
• Weight of pixel = energy function of 8 neighboring pixels
• Seam = shortest path (sum of vertex weights) from top to bottom
• Find longest paths in edge-weighted DAGs
• Topological sort algorithm works with negative weights.
• Negate all the weights, find shortest path, negate weights in result.
• Parallel job scheduling: Given a set of jobs with durations and precedence constraints, schedule the jobs (by finding a start time for each) so as to achieve the minimum completion time.
• Source and sink vertices. Two vertices for each job (begin and end).
• Three edges for each job. Begin to end (weighted by duration); source to begin (0 weight); end to sink (0 weight)
• One edge for each precedence constraint (0 weight)
• Use longest path from the source to schedule each job

## Bellman-Ford

Dijkstra doesn't work with negative edge weights.

A SPT exists iff no negative cycles, A negative cycle is a directed cycle whose sum of edge weights is negative.

Bellman-Ford algorithm:

Dynamic programming algorithm can be used to compute SPT in any edge-weighted graph with no negative cycles in time proportional to $$E \times V$$:

• If distTo[v] does not change during pass $$i$$, no need to relax any edge pointing from $$v$$ in pass $$i+1$$
• Maintain a queue of vertices whose distTo[] changed

Negative cycle can be found by Bellman-Ford algorithm too: If any vertex $$v$$ is updated in phase $$V$$, there exists a negative cycle.

Negative cycle application: Arbitrage detection. Given table of exchange rates, is there an arbitrage opportunity?

• Vertex = currency; edge = transaction; weight = exchange rate
• Find a directed cycle whose product of edge weights > 1
• Take $$-\ln$$ for the weights so that multiplication > 1 turns to addition < 0
• Equivalent to find a negative cycle

## HW7 Seam Carving

Specification

Seam Carving is a content-aware image resizing technique. Remove one row/column every time.

1. Design energy function for pixels
2. Find seam with minimal energy with one pixel every row/column, and every two adjacent seam points differ at most 1 column/row
3. Delete seam

For this problem, dynamic programming is equivalent to treat the image as a graph and find the shortest path on it.

1. Treat every pixel as a vertex $$(column, row)$$ and there are edges connecting from it to vertexes $$(column - 1, row + 1)$$, $$(column, row + 1)$$ and $$(column + 1, row + 1)$$, unless the vertex is invalid.
2. This graph is a DAG so the shortest path problem can be solved by topological sort. Moreover, since edges are connected layer by layer, we can sort it layer by layer too.
3. To optimize the algorithm, energy can be precomputed and stored. When removing seams, only points near the removed location should be updated.
4. Find/remove horizontal and vertical seam are equivalent, except for a transpose.

Solution: SeamCarver.java

# Maximum Flow

Definition:

• A $$st$$-cut is a partition of the vertices into two disjoint sets, with $$s$$ in one set $$A$$ and $$t$$ is the other set $$B$$
• Its capacity is the sum of the capacities of the edges from $$A$$ to $$B$$
• A $$st$$-flow is a an assignment of values to the edges such that:
• Capacity constraints: $$0 \le$$ edge's flow $$\le$$ edge's capacity
• Local equilibrium: inflow = outflow at every vertex (except $$s$$ and $$t$$)
• The value of a flow is the inflow at $$t$$

Mincut problem: Find a cut of minimum capacity

Maxflow problem: Find a flow of maximum value

## Ford-Fulkerson Algorithm

For digraph with $$V$$ vertices, $$E$$ edges and integer capacities between $$1$$ and $$U$$:

1. Start with $$0$$ flow
2. Fina an undirected path from $$s$$ to $$t$$ such that (Augmenting Path)
• Can increase flow on forward edges (not full)
• Can decrease flow on backward edge (not empty)
3. Increase flow on that path by bottleneck capacity

FF performance depends on choice of augmenting paths:

Augmenting pathNumber of pathsImplementation
shortest path$$\le 1/2 E V$$queue (BFS)
fattest path$$\le E \ln (E U)$$priority queue
random path$$\le E U$$randomized queue
DFS path$$\le E U$$stack (DFS)

## Maxflow-Mincut Theorem

Definition: The net flow across a cut $$(A, B)$$ is the sum of the flows on its edges from $$A$$ to $$B$$ minus the sum of the flows on its edges from $$B$$ to $$A$$

Flow-value lemma: Let $$f$$ be any flow and let $$(A, B)$$ be any cut. The net flow across $$(A, B)$$ equals the value of $$f$$

Maxflow-mincut theorem: Value of the maxflow = capacity of mincut

To compute mincut $$(A, B)$$ from maxflow $$f$$: Compute $$A$$ = set of vertices connected to $$s$$ by an undirected path with no full forward or empty backward edges.

## Implementation

Residual network:

• Use forward edge to represent residual capacity
• Use backward edge to represent flow

Augmenting path in original network is equivalent to directed path in residual network Flow Edge API:

Flow Network API:

Ford-Fulkerson algorithm:

## Applications

### Bipartite Matching Problem

$$N$$ students apply for $$N$$ jobs, each gets several offers. Is there a way to match all students to jobs?

Equivalent to: Given a bipartite graph, find a perfect matching.

1. Create $$s$$, $$t$$, one vertex for each student, and one vertex for each job
2. Add edge from $$s$$ to each student (capacity $$1$$)
3. Add edge from each job to $$t$$ (capacity $$1$$)
4. Add edge from student to each job offered (capacity infinity)

Every perfect matching in bipartite graph corresponds to a maxflow of value $$N$$

### Baseball Elimination Problem

Given each team's current wins and losses, and their games left to play, determine whether team $$i$$ can be eliminated (cannot win).

Construct the graph:

1. Create $$s$$, $$t$$, one vertex for each pair of teams other than $$i$$: $$j \leftrightarrow k$$, and one vertex for each team other than $$i$$
2. Add edge from $$s$$ to each pair of teams (capacity = number of games left between this pair)
3. Add edge from each team $$j$$ to $$t$$ (capacity = $$w_i + r_i - w_j$$ ) upper-bound on the games that $$j$$ can win
4. Add edge from each pair of teams to two corresponding teams (capacity infinity)

Team $$i$$ will not be eliminated iff all edges pointing from $$s$$ are full in maxflow

## HW8 Baseball Elimination

Specification

To check whether team $$x$$ is eliminated, we consider two cases:

• Trivial elimination: The maximum number of games team $$x$$ can win is less than the number of wins of some other team $$i$$
• Nontrivial elimination: Solve a maxflow problem as mentioned above

Solution: BaseballElimination.java

# String Sorts

worstaverageextra spacestable?
LSD$$2NW$$$$2NW$$$$N + R$$
MSD$$2NW$$$$N \log_R N$$$$N + DR$$
3-way string quicksort$$1.39 W N \lg R$$$$1.39 N \lg R$$$$\lg N + W$$

## String vs. StringBuilder

Operations:

• Length: Number of characters
• Indexing: Get the $$i$$-th character
• Substring extraction: Get a contiguous subsequence of characters
• String concatenation: Append one character to end of another string
length()charAt()substring()concat()
String111N
StringBuilder11N1

String:

• sequence of characters (immutable)
• Immutable char[] array, offset, and length

StringBuilder:

• Sequence of characters (mutable)
• Resizing char[] array and length

## Key-Indexed Counting

Compare-based algorithms require $$\sim N\lg N$$ compares.

We can do better if we don't depend on key compares.

Key-indexed counting:

• Assumption: keys are integers between $$0$$ and $$R-1$$
• Implication: can use key as an array index

Sort an array a[] of $$N$$ integers between $$0$$ and $$R-1$$:

• Count frequencies of each letter using key as index
• Compute frequency cumulates which specify destinations
• Access cumulates using key as index to move items
• Copy back into original array

## LSD Radix Sort

LSD (Least-Significant-Digit-first) string sort:

• Consider characters from right to left
• Stably sort using $$d^{th}$$ character as the key (key-indexed counting)

## MSD Radix Sort

MSD (Most-Significant-Digit-First) string sort:

• Partition array into $$R$$ pieces according to first character
• Recursively sort all strings that start with each character

Treat variable-length strings as if they had an extra char at end (smaller than any char)

### MSD String Sort vs. Quicksort

Disadvantages of MSD string sort:

• Extra space for aux[]
• Extra space for count[]
• Inner loop has a lot of instructions
• Accesses memory "randomly" (cache inefficient)

Disadvantage of quicksort:

• Linearithmic number of string compares (not linear)
• Has to rescan many characters in keys with long prefix matches

## 3-Way Radix Quicksort

Do 3-way partitioning on the $$d^{th}$$ character

• Less overhead than $$R$$-way partitioning in MSD string sort
• Does not re-examine characters equal to the partitioning char

### 3-Way String Quicksort vs. Standard Quicksort

Standard quicksort:

• Uses $$\sim 2 N \ln N$$ string compares on average
• Costly for keys with long common prefixes (and this is a common case)

3-way string quicksort:

• Uses $$\sim 2 N \ln N$$ character compares on average
• Avoids re-comparing long common prefixes

## Suffix Arrays

Problem: Given a text of $$N$$ characters, preprocess it to enable fast substring search (find all occurrences of query string context)

1. suffix sort the text
2. Binary search for query; scan until mismatch # Tries

String symbol table API:

Goal: Faster than hashing, more flexible than BSTs. ## R-Way Tries

• Store characters in nodes (not keys)
• Each node has $$R$$ children, one for each possible character

Search: Follow links corresponding to each character in the key

• Search hit: node where search ends has a non-null value
• Search miss: reach null link or node where search ends has null value

Insertion: Follow links corresponding to each character in the key

• Encounter a null link: create new node
• Encounter the last character of the key: set value in that node

Deletion:

• Find the node corresponding to the key and set value to null
• If node has null value and all null links, remove that node (and recur)

Performance:

• Search hit: Need to examine all $$L$$ characters for quality
• Search miss: Examine only a few characters for typical case
• Space: $$R$$ null links at each leaf (sublinear if many short strings share common prefixes)

## Ternary Search Tries

• Store characters and values in nodes (not keys)
• Each node has 3 children: smaller (left), equal (middle), larger (right)

Search: Follow links corresponding to each character in the key

• If less, take left link; if greater, take right link
• If equal, take the middle link and move to the next key character

Insertion: Follow links corresponding to each character in the key

• Encounter a null link: create new node
• Encounter the last character of the key: set value in that node

Deletion:

• Find the node corresponding to the key and set value to null
• If node has null value and all null links, remove that node (and recur)

### TST vs. Hashing

Hashing:

• Need to examine entire key
• Search hits and misses cost about the same
• Performance relies on hash function
• Does not support ordered symbol table operations

TST:

• Works only for strings (or digital keys)
• Only examines just enough key characters
• Search miss may involve only a few characters
• Support ordered symbol table operations (plus others)

# Substring Search

Find pattern of length $$M$$ in a text of length $$N$$. Typically $$N \gg M$$

Brute force: Check for pattern starting at each text position.

• We want linear-time guarantee
• We want to avoid backup problem: treat input as stream of data ## Knuth-Morris-Pratt

### DFA

DFA (Deterministic finite state automaton):

• Finite number of states (including start and halt)
• Exactly one transition for each char in alphabet
• Accept if sequence of transitions leads to halt state

The DFA state after reading in text[i] is the length of longest prefix of pattern[] that is a suffix of text[0:i]

Constructing the DFA:

• Copy of dfa[][X] to dfa[][j] for mismatch case
• Set dfa[pat.charAt(j)][j] to j+1 for match case
• Update X

### KMP

Once we have the dfa matrix, we can use this make transition without backup

## Boyer-Moore

• Scan characters in pattern from right to left
• Can skip as many as $$M$$ text chars when finding one not in the pattern

Precompute index of rightmost occurrence of character in pattern to decide how much to skip:

Boyer-Moore:

## Rabin-Karp

Use modular hashing:

• Compute a hash of pattern characters $$0$$ to $$M-1$$
• For each $$i$$, compute a hash of text characters $$i$$ to $$M+i-1$$
• if pattern hash = text substring hash, check for a match

Modular hash function: use the notation $$t_i$$ for txt.charAt(i), compute $x_{i}=t_{i} R^{M-1}+t_{i+1} R^{M-2}+\ldots+t_{i+M-1} R^{0}(\bmod Q)$ Horner's method: linear-time method to evaluate degree-$$M$$ polynomial

Rabin-Karp:

## HW9 Boggle

Specification

Find valid words in a board by connecting characters to its eight neighbors.

1. Add dictionary to 26-way Tries
2. Use dfs to search the board. Early termination can be made when no prefix match
3. dfs can be run with node expansion together to avoid starting from the root every time

Solution: BoggleSolver.java

# Regular Expressions

• Substring search: Find a single string in text
• Pattern matching: Find one of a specified set of strings in text

Regular Expression Pattern:

operationorderexample REmatchesdose not match
concatenation3AABAABAABAABevery other string
or4AA | BAABAA
BAAB
every other string
closure2AB*AAA
ABBBBBA
AB
ABABA
parentheses1A(A | B)AABAAAAB
ABAAB
every other string
parentheses1(AB)*AA
ABABABABA
AA
ABBA
wildcard.U.U.U.CUMULUS
JUGULUM
SUCCUBUS
TUMULTUOUS
character class[A-Za-z][a-z]*word
Captitalized
camelCase
4illegal
at least 1A(BC)+DEABCDE
ABCBCDE
ADE
BCDE
exactly k[0-9]{5}-[0-9]{4}08540-1321
19072-5541
111111111
166-54-111

## NFA

Kleene's theorem:

• For any DFA, there exists a RE that describes the same set of strings
• For any RE, there exists a DFA that recognized the same set of strings

==The DFA built from RE may have exponential number of states, so instead we use NFA (Nondeterministic finite state automata).==

Regular-expression-matching NFA:

• RE enclosed in parentheses
• One state per RE character (start=$$0$$, accept=$$M$$)
• Red $$\epsilon$$- transition (change state, but don't scan text)
• Black match transition (change state and scan to next text char)
• Accept if any sequence of transitions end in accept state The Nondeterminism comes from different transitions. The program has to consider all possible transition sequences.

Basic plan:

• Build NFA from RE
• Simulate NFA with text as input

## NFA Simulation

NFA representation:

• State names: Integers from $$0$$ to $$M$$
• Match-transitions: Keep regular expression in array re[]
• $$\epsilon$$-transitions: Store in a digraph $$G$$

Simulation Steps:

1. Run DFS from each source, without unmarking vertices
2. Maintain set of all possible states that NFA could be in after reading in the first $$i$$ characters
3. When no more input characters, accept if any state in the set is an accept state

Time complexity: Determining whether an $$N$$-character text is recognized by the NFA corresponding to an $$M$$-character pattern takes time proportional to $$MN$$ in the worst case.

## NFA Construction

Construction process:

• States: Include a state for each symbol in the RE, plus an accept state.
• Concatenation: Add match-transition edge from state corresponding to characters in the alphabet to next state.
• Parentheses: Add $$\epsilon$$-transition edge from parentheses to next state.
• Closure: Add three $$\epsilon$$-transition edges for each * operator.
• Or: Add two $$\epsilon$$-transition edges for each | operator Time complexity: Building the NFA corresponding to an $$M$$-character RE takes time and space proportional to $$M$$

# Data Compression

Lossless compression and expansion:

• Message: Binary data $$B$$ we want to compress
• Compress: Generates a "compressed" representation $$C(B)$$
• Expand: Reconstructs original bitstream $$B$$

Static model: Same model for all texts

• Fast
• Not optimal: different texts have different statistical properties
• Ex: ASCII, Morse code

Dynamic model: Generate model bases on text

• Preliminary pass needed to generate model
• Must transmit the model
• Ex: Huffman code

Adaptive model: Progressively learn and update model as you read text

• More accurate modeling produces better compression
• Decoding must start from beginning
• Ex: LZW

## Run-Length Encoding

Simple type of redundancy in a bitstream: Long runs of repeated bits

Representation: 4-bit counts to represent alternating runs of 0s and 1s (15 0s, 7 1s, 7 0s, 11 s)

## Huffman Compression

Variable-length codes: Use different number of bits to encode different chars.

To avoid ambiguity, ensure that no codeword is a prefix of another.

Use binary trie to represent the prefix-free code.

Huffman trie node data type:

Expansion:

Read and write: To find the best prefix-free code, use Huffman algorithm:

• Count freq for each char in input
• Start with one node for each char with weight equal to freq
• Select two tries with min weight and merge into single trie with cumulative weight

## LZW Compression

• Create ST associating $$W$$-bit codewords with string keys
• Initialize ST with codewords for single-char keys
• Find longest string $$s$$ in ST that is a prefix of unscanned part of input
• Write the $$W$$-bit codeword associated with $$s$$
• Add $$s+c$$ to ST, where $$c$$ is next char in the input

Use a trie that supports longest prefix match to represent LZW compression code table:

## HW10 Burrows–Wheeler

Specification

Implement the Burrows-Wheeler data compression algorithm

1. Burrows–Wheeler transform. Given a typical English text file, transform it into a text file in which sequences of the same character occur near each other many times. BurrowsWheeler.java, CircularSuffixArray.java
2. Move-to-front encoding. Given a text file in which sequences of the same character occur near each other many times, convert it into a text file in which certain characters appear much more frequently than others. MoveToFront.java
3. Huffman compression. Given a text file in which certain characters appear much more frequently than others, compress it by encoding frequently occurring characters with short codewords and infrequently occurring characters with long codewords.
]]>
<p>Review of Princeton <a href="https://www.coursera.org/learn/algorithms-part2">Algorithms II</a> course on Coursera.</p> <p><a href="https://silencial.github.io/algorithms-1/">Algorithms I Review</a></p> <p>My <a href="https://github.com/silencial/Algorithms">solution</a> to the homework</p>
CS Tools https://silencial.github.io/cs-tools/ 2020-03-27T00:00:00.000Z 2021-04-30T00:00:00.000Z 介绍 CS 中常用工具，参考 The Missing Semester of Your CS Education 课程。

# Shell

## Bash

Bash 脚本教程

Minimal safe Bash script template

### 基本知识

• shebang: 写在脚本第一行，用来表示用什么执行。例如 #!/usr/bin/env bash#!/usr/bin/env python。使用 env 会自动搜索 PATH 环境变量。
• 变量赋值 foo=bar。==注意不可以使用 foo = bar，因为空格起到执行变量分隔的作用==
• 字符串可由 ''"" 表示，但 "" 会替代变量，如 echo "$foo" 会输出 bar。==而单引号保持内部所有字符的本义== • 脚本执行完会有返回值，0 表示正常 • 通配符 • ? 表示一个字符 • * 表示任意个数字符 • [] 匹配任意一个括号内的字符 • {} 自动展开所有括号内的字符。例如 {a,b}/c 表示 a/c b/c • .. 循环。例如 1..3 表示 1,2,3 • 执行比较时使用 [[]]比较详细命令 • 多命令: • CMD1; CMD2: 顺序执行 • CMD1 && CMD2: 当 1 被正确执行时才会执行 2 • CMD1 || CMD2: 当 1 未被正确执行时才会执行 2 • CMD1 | CMD2: 1 的正确输出作为 2 的操作对象 • 将命令返回作为变量 • command substitution: for file in$(ls)
• process substitution: <(CMD) 执行 CMD 并将输出保存至临时文件，再返回文件名

### 特殊字符

• $0: 脚本名 • $1 to $9: 传递至脚本的变量，$1 代表第一个变量，以此类推
• $@: 所有的变量 • $#: 变量个数
• $?: 前一条命令的返回值 • $$: 当前脚本执行的进程 ID • !!: 上一条命令。例如 sudo !! • $_: 上一条命令的最后一个变量。在交互式 shell 中还可以通过 Esc + . 的方式实现

### 快捷键

• ctrl+u: 清空当前行
• ctrl+a: 移动至行首
• ctrl+e: 移动至行尾
• alt+f: 按单词前移
• alt+b: 按单词后移
• ctrl+k: 删除光标到行尾的内容
• alt+backspace: 从光标删除至字首
• alt+d: 从光标删除至字尾
• ctrl+c: 终止命令

• ctrl+c: 向进程传递 SIGINT 信号，不一定停止
• ctrl+\: 向进程传递 SIGQUIT 信号，终止进程
• ctrl+z: 向进程传递 SIGSTOP 信号，暂停进程
• 关闭命令行窗口等于向进程传递 SIGHUP，终止进程

## Zsh

• cd 可以省略
• 快速往上跳转 n 层目录: n+1 个 .
• 进程补全: kill process_name 并按 Tab 键自动替换进程 id。
• 使用 d 可以列出最近访问过的目录，然后输入对应数字进入目录。
• 如果要进入到 ~/workspace/src/dict 目录，可以输入 cd ~/w/s/d 并按 Tab 进行补全。
• 快速重复上一条命令: r
• 历史记录支持受限查找: 先输入 git，再按 ↑ 会搜索所有用过的 git 命令。
• 通配符搜索: ls **/*.png 搜索当前目录下所有 .png 文件。

## 常用命令

### 文件处理命令

ls [选项] [文件或目录]l='ls -lah' 详细显示所有文件

du -h -d 0 [目录] 查看文件夹大小，-d 代表 depth

mkdir -p [目录名] 递归创建目录，md='mkdir -p'

cd [目录] 打开目录

rm -rf [文件或目录] 强制删除文件或目录。-r 参数为了递归删除目录

cp [选项] [原文件或目录] [目标目录] 拷贝

• -r 复制目录
• -p 连带文件属性复制
• -d 若源文件是链接文件，则复制链接属性
• -a 相当于 -rpd

mv [原文件或目录] [目标目录] 剪切或改名

x=extract 解压文件

### 查找

locate [文件名] 在后台数据库(默认一天更新一次)中按文件名搜索，速度更快。

• updatedb 强制更新数据库

• /etc/updatedb.conf 搜索条件配置文件

find [搜索范围] [搜索条件] 使用通配符，属于完全匹配

• -name 搜索名字
• -iname 不区分大小写

whereis [命令名] 搜索命令所在路径及帮助文档所在位置

• -b 只查找可执行文件
• -m 只查找帮助文件

which [命令名] 查找当前环境下调用的命令所在位置，可显示别名

grep [选项] 字符串 文件名 在文件当中匹配符合条件的字符串，使用正则表达式，属于包含匹配。也可以用来搜索命令返回值

• -i 忽略大小写
• -v 排除指定字符串
• -C 打印 context，例如 grep -C 2 会打印目标上下各 2 行。

### 任务

kill [PID]: 终止进程

pkill [PIDNAME] 按照名称终止进程，可使用 Tab 补全

fg/bg: 在前台/后台继续暂停的进程

jobs: 查看未完成的进程

### 其它命令

man [命令]: 命令帮助文档

systemctl suspend: 休眠

cat [file] | xclip -selection clipboard: 复制文件内容至系统剪贴板

sips --resampleWidth []：脚本图像处理，MacOS 独有

# SSH

Unix 系统中 ssh 分为 openssh-client 和 openssh-server。前者自带，允许连接其它机器。

• Ubuntu: 安装 sudo apt-get install openssh-server
• Mac: 打开 Sharing -> Remote Login

server 的配置文件在 /etc/ssh/sshd_config 中，可以修改端口等设置，默认端口为 22

• ip 地址分清是内网还是公网。例如连接在同一个路由器的设备都属于一个内网，由路由器分配的内网 ip 地址只允许在内网内进行连接
• 查询路由中自动分配的 ip 地址是否为外网，==有可能属于更大的运营商内网==
• 如果想通过外网连接，由于 ip 地址为动态，需要购买 DDNS (动态域名) 进行解析
• 美国这边 ip 基本不变

# Vim ## 操作模式

• Normal: 移动光标。Esc 切换
• Insert: 插入文本。i 切换
• Replace: 替换文本。R 切换
• Visual: 选择文本。v 切换
• Command-line: 执行命令。: 切换

## Vim 参数

• vim + [file] 定位到最后一行
• vim +3 [file] 定位到第三行
• vim +/xxx [file] 定位到 xxx 第一次出现的一行，n 进行切换
• vim [file1] [file2] [file3] 一次性打开多个文件，:n 切换下一个文件，:N:prev 切换上一个文件，:b [tab] 自动补全切换文件
• vim -p [file1] [file2] [file3] 标签页打开多个文件，gt/gT 切换下/上一个标签

## 命令

• :w 保存
• :q 退出
• :wq 保存且退出
• :x 保存（仅在文件有修改时）且退出
• :e [file] 打开文档
• :[command]! 强制执行
• :ls 列出打开的所有文件
• :xxx 定位到第一次出现 xxx 的地方
• ?xxx 向前搜索
• :h [topic] 打开帮助文档，可以是命令或按键
• :w [name] 当前文档另存为，若有 visual 选中则只另存为选中部分
• :r [name] 读取文件内容插入至光标下方，也可以读取命令 :r !ls

## 移动

• h/j/k/l 光标左/下/右/上移
• w/b/e 下一个单词/单词开头/单词结尾
• 0/^/$ 行首/行第一个非空字符/行尾 • H/M/L 屏幕顶部/中部/底部 • C+b/f 向上/下翻页 • C+u/d 向上/下翻半页 • gg/G 文件开头/结尾 • :15 到第 15 行 • 15G 到第 15 行 • /[regex] 搜索，n/N 寻找下/上一个 • f/F 向后/前查找，光标移动至查找对象 • t/T 向后/前查找，光标移动至查找对象前一个位置 • m[X] 设定一个 mark，可以通过 [X] 定位 • ctrl+o/i 回到前一个/后一个位置 • % 匹配括号 在 Visual 模式使用移动来进行选取 ## 编辑 • o/O 在下/上一行插入 • > 缩进。>> 缩进本行 • d[motion] 删除。dw 删除单词，d$ 删除至行尾，d0 删除至行首，dd 删除整行
• c[motion] 修改。cw 修改单词。等价于 d[motion] + i
• 尽量使用 text objects 替换 motion。iw: inner word; it: inner tag; i": inner quotes; ip: inner paragraph; as: a sentence.
• x 删除字符。等价于 dl
• s 替换字符。等价于 xi
• :s/[a]/[b] 将光标所在行的第一个匹配 a 替换为 b。:s/a/b/g 替换该行所有匹配
• :[#],[#]s/[a]/[b]/g#,# 之间的行进行匹配替换
• :%s/[a]/[b]/g 匹配替换整个文件
• :%s/[a]/[b]/gc 在每次替换时进行询问
• r 替换字符。区别在于不进入插入模式
• R 进入替换模式
• yy 复制一行，yw 复制单词
• p/P 在光标所在行下/上方粘贴
• u/U/ctrl+r 撤销/撤销整行/重做
• Visual 模式选取后 d 删除，c 修改

## 特殊

• . 重复上次操作
• 搜索可以搭配 c/d 使用，如 dt" 删除该行在第一个 " 前的内容
• 数字+字母可以进行重复操作。例如 3w 向前移动三个单词
• 删除文本会自动放入 register 中
• ctrl+g 显示当前文件路径和状态
• 在命令前加 no 关闭选项，例如 :set noic
• 命令模式下 ctrl+d 会显示补全选项
• :so % 表示加载当前文件为配置文件，:so ~/.vimrc 指定文件

# Tmux

• tmux 开启新会话
• tmux new -s [NAME] 开启新会话并命名
• tmux new -A 接入最后一个会话，如果没有则新建
• tmux ls 显示当前的所有会话
• <C-b> d 分离当前会话（退出窗口，进程在后台运行）
• tmux a 接入最后一个会话
• tmux a -t [name] 接入特定名称的会话

• <C-b> c 新建窗口
• <C-d> 关闭窗口
• <C-b> [N] 跳到第 N 个窗口
• <C-b> p/n 跳到前/后一个
• <C-b> , 重命名当前窗口
• <C-b> w 列出当前所有窗口

• <C-b> " 水平划分窗格
• <C-b> % 垂直划分窗格
• <C-b> [方向键] 切换到方向键对应的其它窗格
• <C-b> z 全屏/取消当前窗格
• <C-b> <space> 改变窗格排列

# Debugging & Profiling

## Debugging

ipdb: 用来 debug Python，为 IPython 版本的 pdb，支持补全，高亮等功能。

python -m ipdb [文件]

• l(ist): 显示当前行周围的 11 行代码，可以重复使用
• s(tep): 执行当前行
• c(ontinue): 执行到断点或出错的位置
• b(reak): 设置断点
• p(rint): 打印当前环境下变量
• r(eturn): 继续执行直到当前函数返回
• q(uit): 退出

## Profiling

time: 计算命令耗时

• real: 实际花费总时间
• user: 在用户空间花费的 CPU 时间
• sys: 在内核空间花费的 CPU 时间

python 中可以使用 line_profiler 模块对每一行计时，memory_profiler 进行内存计算

# Make

Make 是最常用的构建系统。任何只要文件变化就要重新构建的项目都可以用 Make 构建。

## Makefile

Makefile 范例：（注意第二行需要以 tab 键开头）

• dependency 也可以有自己的构建规则，并组成 dependency graph
• target 除了可以是文件之外也可以是命令，称为 phony target。为了避免与将当前目录下的同名文件混淆，需要在开头声明 .PHONY 变量

## 语法

• $@: 表示 target • $^: 表示所有 dependency
• <: 表示第一个 dependency 模式匹配： 内置函数： • wilcard: 通配符匹配 • patsubst [pattern],[replacement],[text]: 模式匹配替换 ## Make 命令 运行 make 时，会在当前目录下寻找 Makefile，只有当 target 不存在或比 dependency 旧时才会执行 command • make [target]: 指定 make 目标，若没有则默认执行第一个 • make -n [target]: dry-run • make -B [target]: 无条件构建目标 • make -C [dir] [target]: 改变 make 的运行目录 • make -f [file] [target]: 指定 makefile 文件 # C++ ## GCC GCC (GNU Compiler Collection) 是 C 的编译器，编译 C 源码有四个步骤：预处理-----> 编译 ----> 汇编 ----> 链接 1. 预处理阶段，编译器将 C 源代码中的包含的头文件编译进来。 gcc -E hello.c -o hello.i 2. 编译阶段，gcc 检查代码的规范性、是否有语法错误等，以确定代码的实际要做的工作。检查无误后，把代码翻译成汇编语言。 gcc -S hello.i -o hello.s 3. 汇编阶段，把编译阶段生成的 .s 文件转成二进制目标代码。 4. 链接阶段，生成可执行文件 • -E 仅执行编译处理 • -S 将 c 代码转换为汇编代码 • -c 仅执行编译操作，不进行连接操作 • -o 指定生成的输出文件 • -wall 显示警告信息 ## CMakeLists.txt CMake 是 Makefile 的上层工具，目的为了自动产生可移植的 Makefile。CMake 的配置文件： • a.cpp 编译为可执行文件（执行 main 函数） • b.cpp 编译为库文件，静态或共享 • c.cpp 中调用了 b.cpp 的函数，因此需要链接到 b 的库文件。同时需要定义一个 b.h 的头文件声明 b 中的函数，并在 c.cpp 中使用 include "b.h" ## 编译及安装 通过生成 build 文件夹，将源码和生成文件分开。make install 需要 CMakeLists.txt 中定义 install 命令。 一般第三方库文件的 install 命令会将头文件、库文件、可执行文件分别安装至 CMAKE_INSTALL_PREFIX 变量（默认为 /usr/local）下的 include/, lib/bin/ 文件夹中。之后即可以在 CMakeLists.txt 中使用 find_package 语句方便调用。 卸载时使用make uninstall，但作者可能没有定义 uninstall 命令，此时可在 make install 执行的文件夹中找到 install_manifest.txt 文件，列出了所有安装文件的路径。使用 xargs rm < install_manifest.txt 删除即可。 可以通过 cmake -DCMAKE_INSTALL_PREFIX=<path> 来改变安装位置，方便管理。 ### CMake Package Registry 当 CMake 只通过 make 而没有 install 时依旧能通过 find_package 找到包时，说明原先的 CMakeLists.txt 中存在 export 语句，通过 User Package Registry 的方式指定了路径。文件位置在 ~/.cmake/package/<package> ，其中的文件可以为任意名称，内容是搜索路径。 ## find_package 导入第三方库。使用方法为 该语句主要会在以下变量定义的地址寻找 <package>Config.cmake 的文件并导入，或通过上面所说的 Cmake Package Registry 的方式。完整寻找方式参照 Search Procedure 一般来说使用 find_package 语句后会得一些关于该包的变量，以便之后进行调用 # Q & A ## source & exec & ./script • source 会在当前 shell 中执行命令 • exec 会终止当前 shell，并重新生成一个新的进程来执行命令 • ./script 生成一个新的 shell 环境，在该 shell 中执行命令后退出 应使用 exec zsh 而非 source ~/.zshrc 来重新加载 .zshrc 文件 ## Mac 上某些命令和 Linux 不同 Linux 使用的命令为 GNU 版本，Mac 上可以使用 brew install coreutils 来进行安装，安装后的命令可以在 /usr/local/bin 中找到，以 g 作为前缀，若想替代 Mac 本身的命令则直接在该目录下建立软链接即可。（默认 PATH/usr/local/bin 的优先级高于 /usr/bin ## 文件夹/文件命令规则 不要不要不要加空格，在命令行中会造成很多麻烦。如果无法避免且只是少数文件/文件夹的情况，可以在同一位置建立良好命名的软链接。 # Trivia • .zshrc 中的 rc 是 run command 的意思 ]]> <p>介绍 CS 中常用工具，参考 <a href="https://missing.csail.mit.edu/">The Missing Semester of Your CS Education</a> 课程。</p> <p>其它相关 Blog:</p> <ul> <li><a href="https://www.notion.so/silencial/6dedf8a81bf840528e2b442477a3928c">常用软件</a>: Notion 个人主页</li> <li><a href="https://silencial.github.io/setup/">Setup</a>: 自用工具及配置</li> <li><a href="https://silencial.github.io/git/">Git 101</a>: 常见的 git 命令</li> <li><a href="https://silencial.github.io/regex/">Regular Expression</a>: 常见的正则表达式用法</li> <li><a href="https://silencial.github.io/linux/">Linux</a>: 介绍 Linux 基本知识与操作</li> </ul> Mobile Robots https://silencial.github.io/mobile-robots/ 2020-03-19T00:00:00.000Z 2020-03-20T00:00:00.000Z CSE 490R Review My solution to the homework # Lab0 • Localization: Determine the pose of a robot relative to a given map of the environment. • Mapping: Construct a map. • Planning: ## Bayes Filter The Bayes filter algorithm includes two steps, the first step is prediction (push belief through dynamics given action), the second step is correction (apply Bayes rule given measurement). $\overline{bel}\left(x_{t}\right) = \int p\left(x_{t} | u_{t}, x_{t-1}\right) bel\left(x_{t-1}\right) d x_{t-1} \\bel\left(x_{t}\right) = \eta p\left(z_{t} | x_{t}\right) \overline{bel}\left(x_{t}\right)$ ## Kalman Filter Assumptions: 1. Linear dynamics $$p(x_t | u_t, x_{t-1}) = A_{t} x_{t-1}+B_{t} u_{t}+\varepsilon_{t}$$, where $$\varepsilon_t \sim \mathcal{N}(0, R_t)$$ 2. Linear measurement model $$p(z_t | x_t) = C_{t} x_{t}+\delta_{t}$$, where $$\delta_t \sim \mathcal{N}(0, Q_t)$$ 3. Initial belief $$bel(x_0)$$ is a Gaussian distribution $bel\left(x_{0}\right)=p\left(x_{0}\right)=\operatorname{det}\left(2 \pi \Sigma_{0}\right)^{-\frac{1}{2}} \exp \left\{-\frac{1}{2}\left(x_{0}-\mu_{0}\right)^{T} \Sigma_{0}^{-1}\left(x_{0}-\mu_{0}\right)\right\}$ Algorithm: \mathbf{\text{Algorithm Kalman_filter(}}\mu_{t-1}, \Sigma_{t-1}, u_t, z_t \mathbf{):} \\\begin{aligned}\bar{\mu}_{t} &=A_{t} \mu_{t-1}+B_{t} u_{t} \\\bar{\Sigma}_{t} &=A_{t} \Sigma_{t-1} A_{t}^{T}+R_{t} \\K_{t} &=\bar{\Sigma}_{t} C_{t-1}^{T}\left(C_{t} \bar{\Sigma}_{t} C_{t}^{T}+Q_{t}\right)^{-1} \\\mu_{t} &=\bar{\mu}_{t}+K_{t}\left(z_{t}-C_{t} \bar{\mu}_{t}\right) \\\Sigma_{t} &=\left(I-K_{t} C_{t}\right) \bar{\Sigma}_{t} \\&\text{return } \mu_{t}, \Sigma_{t}\end{aligned} # Lab1 ## Motion Model • Kinematic model: map wheel speeds to robot velocities. • Dynamic model: map wheel torques to robot accelerations. Consider only kinematic model $$p(x_t | u_t, x_{t-1})$$ for now (assume we can set the speed directly), then $$x = [x, y, \theta]^T$$, $$u = [V, \delta]^T$$, where $$\theta$$ is the heading and $$\delta$$ is the steering angle. Now we will derive the motion model for rear axel. Note that a rigid body undergoing rotation and translation can be viewed as pure rotation about a instant center of rotation: \begin{aligned}&\dot{x}=V \cos (\theta)\\&\dot{y}=V \sin (\theta)\\&\dot{\theta}=\omega=\frac{V \tan \delta}{L}\end{aligned} With numerical integration we can get $x_{t+1}=x_{t}+\frac{L}{\tan (\delta)}\left(\sin \left(\theta_{t+1}\right)-\sin \left(\theta_{t}\right)\right) \\y_{t+1}=y_{t}+\frac{L}{\tan (\delta)}\left(-\cos \left(\theta_{t+1}\right)+\cos \left(\theta_{t}\right)\right) \\\theta_{t+1}=\theta_{t}+\frac{v}{L} \tan (\delta) \Delta t \quad$ ### Noise 1. Control signal error: $$\hat{V} \sim \mathcal{N}(V, \sigma_v^2)$$, $$\hat{\delta} \sim \mathcal{N}(\delta, \sigma_\delta^2)$$ 2. Incorrect physics (the Kinematic Model is inaccurate): $$\hat{x} \sim \mathcal{N}\left(x, \sigma_{x}^{2}\right)$$, $$\hat{y} \sim \mathcal{N}\left(y, \sigma_{y}^{2}\right)$$, $$\hat{\theta} \sim \mathcal{N}\left(\theta, \sigma_{\theta}^{2}\right)$$ ## Sensor Model $$p(z_t | x_t, m)$$ is the probability of sensor reading $$z_t$$ given state $$x_t$$ and map $$m$$. Calculate simulated sensor reading $$z^*$$ form $$x$$ and $$m$$ and then compare with $$z$$. Assume individual beams are conditionally independent given map ==(may result in overconfidence problem)==: $p\left(z_{t} | x_{t}, m\right)=\prod_{i=1}^{K} p\left(z_{t}^{k} | x_{t}, m\right)$ ### Noise 1. Simple measurement noise in distance value $p_{\text {hit }}\left(z_{t}^{k} | x_{t}, m\right)=\begin{cases} \eta \mathcal{N}\left(z_{t}^{k} ; z_{t}^{k *}, \sigma_{\text {hit }}^{2}\right) & \text { if } 0 \leq z_{t}^{k} \leq z_{\max } \\ 0 & \text { otherwise } \end{cases}$ 2. Presence of unexpected objects $p_{\text {short }}\left(z_{t}^{k} | x_{t}, m\right)=\begin{cases} \eta \lambda_{\text {short }} e^{-\lambda_{\text {short }} z_{t}^{k}} & \text { if } 0 \leq z_{t}^{k} \leq z_{t}^{k *} \\ 0 & \text { otherwise } \end{cases}$ 3. Laser returns max range when no objects $p_{\max }\left(z_{t}^{k} | x_{t}, m\right)=I\left(z=z_{\max }\right)=\begin{cases} 1 & \text { if } z=z_{\max } \\ 0 & \text { otherwise } \end{cases}$ 4. Failures in sensing $p_{\text {rand }}\left(z_{t}^{k} | x_{t}, m\right)=\begin{cases} \frac{1}{z_{\max }} & \text { if } 0 \leq z_{t}^{k}<z_{\max } \\ 0 & \text { otherwise } \end{cases}$ Combine these 4 model to get $p\left(z_{t}^{k} | x_{t}, m\right)=\begin{pmatrix}z_{\mathrm{hit}} \\z_{\text {short }} \\z_{\text {max }} \\z_{\text {rand }}\end{pmatrix}^{T} \begin{pmatrix}p_{\text {hit }}(z_{t}^{k} | x_{t}, m) \\p_{\text {short }}(z_{t}^{k} | x_{t}, m) \\p_{\text {max }}(z_{t}^{k} | x_{t}, m) \\p_{\text {rand }}(z_{t}^{k} | x_{t}, m)\end{pmatrix}$ ## Particle Filter The particle filter is an alternative nonparametric implementation of the Bayes filter. The key idea is to represent the posterior $$bel(x_t)$$ by a set of random state samples (particles) drawn from this posterior. \mathbf{\text{Algorithm Particle_filter(}} \mathcal{X}_{t-1}, u_t, z_t \mathbf{):} \\\begin{aligned}&\bar{\mathcal{X}}_{t}=\mathcal{X}_{t}=\emptyset \\&\text{for } m=1 \text{ to } M \text{ do} \\&\qquad \text{sample } x_{t}^{[m]} \sim p\left(x_{t} | u_{t}, x_{t-1}^{[m]}\right) \\&\qquad {w_{t}^{[m]}=p\left(z_{t} | x_{t}^{[m]}\right)} \\&\qquad {\bar{\mathcal{X}}_{t}=\bar{\mathcal{X}}_{t}+\left\langle x_{t}^{[m]}, w_{t}^{[m]}\right\rangle} \\&\text{endfor } \\&\text{for } m = 1 \text{ to } M \text{ do} \\&\qquad \text{draw } i \text{ with probability } \propto w_{t}^{[i]} \\&\qquad \text{add } x_t^{[i]} \text{ to } \mathcal{X}_t \\&\text{endfor} \\&\text{return } \mathcal{X}_t\end{aligned} We first sample $$x_t$$ from the state transition distribution, then calculate the importance factor, $$w_t$$. In the next loop, we do the resampling/importance sampling, to change the distribution of particles from $$\overline{bel}(x_t)$$ to $$bel(x_t)$$. ### Resample Resampling can cause high-variance (low-entropy) problem, where particles are depleted. Possible fixes: 1. If the variance of weights low, don't resample. 2. Use low-variance sampling. \mathbf{\text{Algorithm Low_variance_sampler(}} \mathcal{X}_{t}, \mathcal{W}_{t} \mathbf{):} \\\begin{aligned}&\bar{X}_{t}=\emptyset \\&r=\operatorname{rand}\left(0 ; M^{-1}\right) \\&c=w_{t}^{} \\&i=1 \\&\text{for } m=1 \text{ to } M \text{ do} \\&\qquad U=r+(m-1) \cdot M^{-1} \\&\qquad \text{while } U > c \\&\qquad\quad i=i+1 \\&\qquad\quad c=c+w_{t}^{[i]} \\&\qquad \text{endwhile} \\&\qquad \text{add } x_{t}^{[i]} \text{ to } \bar{X}_{t} \\&\text{endfor } \\&\text{return } \bar{\mathcal{X}}_t\end{aligned} ### Expected Pose Calculate expected pose from particles: $$x$$ and $$y$$ can be computed directly by the weighted average. However, the weighted average of $$\theta$$ is not accurate. Thus we use cosine and sine averaging. (Ref) ## Code MotionModel.py: 1. Subscribe to motor topic, get control info (speed and steering angle), also save last frame info for calculation. 2. Add variance to speed and steering angle. Apply motion model on particles. Add variance to states. SensorModle.py: 1. Precompute the sensor model table and use range_libc package to achieve 2D raycasting for 2D occupancy grid. 2. Subscribe to scan topic, filter out invalid and extreme scan values, downsample, and pass processed data to update weights. ParticleFilter.py: 1. Transfer map to occupancy grid. 2. Globally initialize particles and weights. Initialize motion model and sensor model. 3. Subscribe to /intialpose topic, initialize particles and weights around initial pos. 4. When receiving scan infos, update weights, resample and calculate the expected pose for visualization. # Lab2 Uses ModelStability GuaranteeMinimize Cost PIDNoNoNo Pure PursuitCircular arcsYes - with assumptionsNo LyapunovNon-linearYesNo LQRLinearYesQuadratic iLQRNon-linearYesYes ## Control Position error in frame A: ${}^{A}e = \begin{bmatrix}x \\ y\end{bmatrix} - \begin{bmatrix}x_{ref} \\ y_{ref}\end{bmatrix}$ We want position error in frame B so that $$x$$ and $$y$$ error correspond to along-track and cross-track error respectively: \begin{aligned}{}^{B}e &= {}_A^{B} R {}^{A}e \\&=R(-\theta_{ref}) \left( \begin{bmatrix}x \\ y\end{bmatrix} - \begin{bmatrix}x_{ref} \\ y_{ref}\end{bmatrix} \right) \\&= \begin{bmatrix}\cos \left(\theta_{r e f}\right)\left(x-x_{r e f}\right)+\sin \left(\theta_{r e f}\right)\left(y-y_{r e f}\right) \\-\sin \left(\theta_{r e f}\right)\left(x-x_{r e f}\right)+\cos \left(\theta_{r e f}\right)\left(y-y_{r e f}\right)\end{bmatrix} \\&= \begin{bmatrix}e_{at} & e_{ct}\end{bmatrix}^T\end{aligned} Only consider cross-track error $$e_{ct}$$ and control steering angle for now. ## PID $u=-\left(K_{p} e_{c t}+K_{i} \int e_{c t}(t) d t+K_{d} \dot{e}_{c t}\right)$ where • $$K_p$$: Proportional coefficient • $$K_i$$: Integral coefficient • $$K_d$$: Derivative coefficient Analytically compute the derivative term: \begin{aligned}\dot{e}_{c t} &=-\sin \left(\theta_{ref}\right) \dot{x}+\cos \left(\theta_{ref}\right) \dot{y} \\&=-\sin \left(\theta_{ref}\right) V \cos (\theta)+\cos \left(\theta_{ref}\right) V \sin (\theta) \\&=V \sin \left(\theta-\theta_{ref}\right) \\&=V \sin \left(\theta_{e}\right)\end{aligned} ## Pure-Pursuit Key idea: The car is always moving in a circular arc. 1. Find a lookahead and compute arc 2. Move along the arc 3. Go to step 1 Solve for arc: $\alpha=\tan ^{-1}\left(\frac{y_{ref}-y}{x_{ref}-x}\right)-\theta \\R = \frac{L}{2 \sin\alpha} \\\dot{\theta} = \frac{V}{R} = \frac{2V\sin\alpha}{L} \\\dot{\theta} = \frac{V\tan u}{B}$ where $$B$$ is the car length. Solve for $$u$$ we can get $u = \tan^{-1}\left( \frac{2B \sin \alpha}{L} \right)$ ## Lyapunov Control Define Lyapunov function: $V\left(e_{ct}, \theta_{e}\right)=\frac{1}{2} k_{1} e_{ct}^{2}+\frac{1}{2} \theta_{e}^{2}$ Compute the derivative: \begin{aligned}\dot{V}\left(e_{c t}, \theta_{e}\right)&=k_{1} e_{c t} \dot{e}_{c t}+\theta_{e} \dot{\theta}_{e}\\&=k_{1} e_{c t} V \sin \theta_{e}+\theta_{e} \frac{V}{B} \tan u\end{aligned} Set $$u$$ to get $$\dot{V} < 0$$: $\theta_{e} \frac{V}{B} \tan u=-k_{1} e_{c t} V \sin \theta_{e}-k_{2} \theta_{e}^{2} \\\tan u=-\frac{k_{1} e_{c t} B}{\theta_{e}} \sin \theta_{e}-\frac{B}{V} k_{2} \theta_{e} \\u=\tan ^{-1}\left(-\frac{k_{1} e_{c t} B}{\theta_{e}} \sin \theta_{e}-\frac{B}{V} k_{2} \theta_{e}\right)$ ## LQR Turn the problem into an optimization to trade-off both driving error and keeping control action small: $\min _{u(t)} \int_{0}^{\infty}\left(w_{1} e(t)^{2}+w_{2} u(t)^{2} \right) dt$ Given 1. Linear dynamic system $x_{t+1} = Ax_t + Bu_t$ 2. Quadratic cost $J=\sum_{t=0}^{T-1} x_{t}^{T} Q x_{t}+u_{t}^{T} R u_{t}$ The optimal control sequence minimizing the cost is $u_t = K_t x_t \\K_{t}=-\left(R+B^{T} V_{t+1} B\right)^{-1} B^{T} V_{t+1} A \\V_{t}=Q+K_{t}^{T} R K_{t}+\left(A+B K_{t}\right)^{T} V_{t+1}\left(A+B K_{t}\right)$ ## MPC 1. Plan a sequence of control actions 2. Predict the set of next states to a horizon H 3. Evaluate the cost/constraint of the states and controls 4. Optimize the cost ## Code controlnode.py: main script 1. Define Subscribers to pose and path infos; Publishers for visualization and Services for reset 2. Enter main program when initial pose is set and controller is ready. Get pose and reference pose, get next control and publish to /vesc/high_level/ackermann_cmd_mux/input/nav_0 topic. 3. Stop when path is completed runner_script.py: 1. Loading different paths and speed 2. Start the controller controller.py: base controller class 1. Store path info 2. Define utility functions: get reference pose by index, get pose and pose_ref error ==Find reference pose function is the same for all controllers: Find the nearest point on the path and lookahead some distance.== pid.py: 1. Use PD equation to get next control (steering angle) purepursuit.py: 1. Use Purepursuit equation to get next control mpc.py: 1. Dividing steering angle $$[-\pi, \pi]$$ equally to $$K$$ rollouts 2. Execute each steering angel through $$T$$ timesteps to collect $$K * T$$ poses 3. Evaluate the cost of each rollout by collision cost and error cost 1. Collision cost: if any pose in a trajectory is in collision with the map, add a large cost 2. Error cost: norm of the distance between last pose and the reference pose, weighted by a constant 4. Choose the rollout with the minimal cost and execute the first step 5. Return to step 1 mpc2.py: Similar to mpc but use scan info rather than map. The only thing change is the obstacles cost. 1. Calculate $$N$$ obstacles pose in the map by scan info 2. For $$K*T$$ poses, calculate the distance with every obstacles to get $$K*T*N$$ array 3. Find the minimal distance for every pose to get $$K * T$$ array 4. Average through timesteps to get $$K$$ array, then weighted by a constant to get the obstacles cost. nonlinear.py: 1. Use Lyapunov control equation to get the control # Lab3 Steps for planing the path from one point to another: 1. Random sample points on the map and construct the graph 2. Use planning algorithm (A*) to search the graph for optimal path ## Graph Construction The graph we are constructing is called Random geometric graph (RGG) 1. Sample a set of collision free vertices $$V$$ 2. Connect neighboring vertices to get edges $$E$$ ### Sampling Uniform random sampling tends to clump. We want points to be spread out evenly, which can be achieved by Halton sequence ### Optimal Radius We want to choose the radius smaller enough for efficiency while ensuring connectivity. The optimal value can be chosen as $r = \left(\frac{\ln|V|}{\alpha_{p,d} |V|}\right)^{1/n}$ where $$\alpha_{p,d}$$ is a constant. For special case of a two-dimensional space and the euclidean norm ($$d=2$$, $$p=2$$), $$\alpha_{p,d} = \pi$$ ### Dubins Path Since we are considering 2-D car dynamics, we need to connect two points with feasible path instead of straight lines. Mathematically, we need to solve the BVP: $\dot{q}(t) = f(q(t), u(t)) \\q(0) = q_1,\quad q(t) = q_2$ where $$q=(x,y,\theta)$$ in our case. Dubins path shows that the solution always exists and has to be one of 6 classes: $\{LRL, RLR, LSL, LSR, RSL, RSR\}$ where $$L,R,S$$ represent turn left, right, straight, respectively. ## Planning Algorithm Given start node $$s_0$$, goal $$s_1$$ and cost $$c(s, s')$$, create objects: 1. OPEN: priority queue of nodes to be processed 2. CLOSED: list of nodes already processed 3. $$g(s)$$: estimate of the least cost from start to a given node The pseudocode for best first search can be expressed as 1. Push $$s_0$$ into OPEN 2. While $$s_1$$ not expanded 1. Pop best from OPEN 2. Add best to CLOSED 3. For every successor s' 1. If $$g(s') > g(s) + c(s, s')$$ 1. $$g(s') = g(s) + c(s, s')$$ 2. Add (update) $$s'$$ to OPEN The main problem is how to choose the heuristic function $$f(s)$$ for step 3. ### Dijkstra's Algorithm Choose $$f(s) = g(s)$$. Always pop the node with the smallest cost from the origin first. ### A* If we can pre-evaluate the cost from the node to the goal $$h(s)$$, then we can choose a better $$f(s) = g(s) + h(s)$$. • If $$h(s)$$ is admissible $$h(s) \le h^*(s)$$, $$h(goal) = 0$$, then the path return by A* is optimal. • If $$h(s)$$ is consistent $$h(s) \le c(s, s') + h(s')$$, $$h(goal) = 0$$, then A* is optimal and efficient (will not re-expand a node). All consistent heuristics are admissible, not vice versa. ### Weighted A* Choose $$f(s) = g(s) + \epsilon h(s)$$, where $$\epsilon > 1$$. It is more efficient and the solution is $$\epsilon$$-optimal $$c \le \epsilon c^*$$ ### Lazy A* Instead of checking edge collision for all neighbors, only check the edge to parent when expanding. ==OPEN list will have multiple copies of a node since we haven't collision check.== ### Shortcut After a path is found, we can randomly pick two nodes and connect them directly if the edge is collision-free. ## Code run.py: main program 1. Load map info 2. Construct graph given environment, sampler, number of vertices, connection radius. 3. Add start and end node 4. Use A* or lazy A* algorithm to search the optimal path and visualize MapEnvironment.py: define utility functions associated with planning 1. Check edge collision 2. Compute heuristic and distance function 3. Generate path on the map 4. Visualization of the graph and path Sampler.py: create random samples for graph construction 1. Use halton sequence to generate 2-D random vertices in $$(0,1)$$ 2. Scale by map info graph_maker.py: construct graph 1. Use python package NetworkX to easily construct a graph 2. Add sampled valid vertices 3. Connect edges within radius and without collision (do not check collision if using lazy A*) astar.py: A* algorithm 1. Use heapq package to create priority queue to store [f(s), count, node, g(s), parent] info. count is used to prevent comparing node when $$f(s)$$ are the same. 2. Use dict enqueued to store $$g(s)$$ and $$h(s)$$ for a node, and another dict explored to store the explored node and its parent node. 3. Add start node to the queue. 4. While queue is not empty, pop one node and add it to explored if it is not already there. 5. For all the neighbor node $$s'$$, compute $$g(s') = g(s) + c(s,s')$$ if $$s'$$ is not already in explored. 6. If $$s'$$ is in enqueued, get its previous $$g'(s')$$ and $$h'(s')$$. Continue to next neighbor if $$g'(s) \le g(s')$$, since it is better. If not in enqueued, compute $$h(s')$$ by the heuristic function. 7. Update $$g(s')$$ and $$h(s')$$ in enqueued and push it to the queue. 8. If reach target node, continually find parent node by explored and return the path. lazy_astar.py: lazy A* algorithm 1. When expanding node, check its edge collision with parent. 2. When checking neighbors, do not need to compare $$g'(s')$$ and $$g(s')$$ 3. Multiple nodes with different parents can be added to the queue. runDubins.py: similar to run.py but use Dubins environment DubinsMapEnvironment.py: inherited from MapEnvironment 1. Compute distances with Dubins path 2. Compute heuristic with Dubins path 3. Generate path by Dubins path planning DubinsSampler.py: 1. Sample $$N$$ vertices same as in Sampler.py 2. Divide angles to $$M$$ parts, add angles to each vertices to create $$M\times N$$ samples Dubins.py: utility function for Dubins path ROSPlanner.py: take goal from rviz and do planning. ==Be careful of the frame change between map and world== 1. Load map info 2. Construct graph and save for later use 3. Subscribe to pose topic, save current pose 4. Subscribe to goal topic, plan from the current pose to the goal, publish the path using service defined in lab2 5. If there are multiple goals, sequentially planning through them and combine the path # Reference ]]> <p><a href="https://courses.cs.washington.edu/courses/cse490r/19sp/">CSE 490R</a> Review</p> <p>My <a href="https://github.com/silencial/Mobile-Robots">solution</a> to the homework</p> ROS 101 https://silencial.github.io/ros-101/ 2020-03-01T00:00:00.000Z 2020-12-08T00:00:00.000Z ROS basic tutorial # ROS Basic ## Publisher & Subscriber ==Publisher can't publish immediately after it is created. Need rospy.sleep(1) or use latch.== ## Messages ### Create a msg Make msg directory inside ROS package, create Test.msg file with datatype content, for example: Add the following two lines to package.xml Change CMakeList.txt as follows: It can be imported by from [package].msg import Test ## Service ### Create a srv Similar to msg, create Test.srv in srv/; change package.xml and CMakeLists.txt, the difference is: It can be imported by from [package].srv import Test ## Transform Frames are published on the tf topic. # Launch File & Parameter Parameters will be in name domain, for example, the above launch file will create a /node_name/arg1 parameter. In the python file, arg1 can be read by ~ means in current node name domain. # Visualization ## Messages Visualize messages (values) using MarkerArray ## Rviz Running rviz without parameter will load the config in ~/.rviz/default.rviz Save rviz config file and use rviz -d [config] will launch Rviz and load the specified config file. This can be integrated in the launch file: # Command Line ## File System ## Package Use Catkin Command Line Tools to replace ROS built-in catkin tool. ## Run ## Running Info • Nodes are just executable files within a ROS package, which can publish/subscribe to Topics or provide/use Services. • Topics are for nodes to exchange messages. • Services are similar to topics but for RPC (Remote procedure call) request. • Messages are simple data structures. # Rosbag Filter a recorded rosbag by time: Export certain topic from a rosbag to csv/txt: ]]> <p>ROS basic tutorial</p> Setup https://silencial.github.io/setup/ 2020-02-25T00:00:00.000Z 2021-05-13T00:00:00.000Z 收集当前使用的工具及配置。其它相关 Blog: 本文所有 dotfiles 工作/摸鱼环境：当前使用的是一台 MBP 15" 2018 + Dell 2720Q 4K 显示器。时常会命令行或 VSCode ssh 到 Lab 的 Ubuntu 下。之前有过一台 Alienware 17" R4 安装了 Windows&Ubuntu 双系统。Ubuntu 为主系统，Windows 只用来 Steam。由于需要搬家 + 感觉 MBP 买了不用有些浪费所以卖掉了。目前没有配台式的打算，游戏环境准备完全交给 Xbox # Yabai macOS 下平铺式窗口管理器，搭配 skhd 热键守护进程 + spacebar 状态栏实现桌面环境大改造。快速切换、排列窗口及桌面，告别鼠标和触控板。安装与使用前请完整阅读 wiki 一些系统设置（均在 System Preferences）： • 防止桌面顺序自动更改：取消勾选 Mission Control -> Automatically rearragen Spaces based on most recent use • 减少动画效果：勾选 Accessibility -> Reduce Motion • 隐藏 Dock：勾选 Dock -> Automatically hide and show the Dock • 隐藏菜单栏：勾选 General -> Automatically hide and show the menu bar ## Spacebar 一款轻量状态栏，配合 Yabai 使用，显示当前桌面和窗口信息。安装和配置都比较简单。显示图标注意字体问题 ## Skhd 简单的热键守护进程，快捷键触发命令，配合 Yabai 使用。配置文件规则简单，搭配 Karabiner 键盘映射软件可以实现更多定制化需求 # Karabiner macOS 键盘映射软件。添加 ~/.config/karabiner/assets/complex_modifications/custom.json 文件并编写规则即可。配置文件主要参考：Capslock • 定义 Capslock 键单独按下时为 Esc，与其它键组合时变为 Hyper • 与 h, j, k, l 组合变为上下左右 # Alfred 替换 Mac Spotlight，功能强大。可用来打开软件、文件、网页；搜索书签；查看剪贴板。另外可以添加 Workflows 扩展功能。推荐 Workflow： 自动备份设置（存在敏感信息）：Preferences -> Advanced -> Set preferences folder # Pock 将 Dock 栏显示在 Touch Bar 上 # Lunar 根据 Macbook 的自带光传感器自动调节外接显示器亮度和对比度。界面好看，模式种类多，需要自己根据需要调节一些参数 # Rime 全平台开源输入法，在 macOS 下为鼠须管。可定制化选项多，上手比较复杂，参加帮助文档。全部数据都在本地，设置和词库可以跨平台同步。 # Typora Markdown 编辑器，平时用来做笔记，写 Blog。Typora 天下第一 🐶 主题推荐：Ursine Polar, Cobalt, Hivacruz # Zotero 知识管理平台，主要用来整理文献，并配合 Dropbox 同步。 Zotero 设置 Preferences -> Advanced -> Files and Folders 1. Link Attachment Base Directory: 选择附件根目录 .../Dropbox/Zotero 2. Data Directory Location: 选择链接根目录 .../Zotero/storage 并且在 Preferences -> Sync 中取消 File Syncing 选项 Zotero 下载插件 ZotFile，在 Tools -> ZotFile Preferences 中更改 1. Source Folder for Attaching New Files: 选择链接根目录 2. Custom Location: 选择附件根目录 3. Use subfolder defined by: 选择根据 collection 建立子文件夹 /%c 全选 Zotero 条目，右键 Manage Attachments -> Rename Attachments 即可重名名 PDF 附件并移动至设置的文件夹中。重命名规则可在 ZotFile Preferences -> Renaming Rules 中更改 ==该方法在删除条目时无法删除对应 PDF，可以使用 zot_rm_unmaintained_files.py 自动删除== # Mathpix Snip 截图数学公式转化 Mathjax 语法，准确率高，用 Typora 写学习笔记时非常方便。免费用户支持 50条/月 # Surfingkeys 浏览器插件，键盘操作网页，按键模式与 Vim 类似。上手简单，功能齐全，支持自定义 # VSCode 代码编辑器，插件生态全，迭代更新快，又是一个天下第一系列 🐶 一些推荐插件 • Ayu: 主题 • Dumb copy-paste: 使用 ctrl+shift+v 可以保存缩进格式粘贴 • GitLens: 在文件中显示 git 信息 • Material Theme: 主题，主要用它来显示文件图标 • Path Intellisense: 自动补全文件路径 • Rainbow CSV: 高亮 csv 文件 • Sublime Text Keymap and Settings Importer: sublime 快捷键和配置无痛转换 • Todo Tree: 高亮关键字 ## Python Autoformatting 1. 安装 python 静态代码检查工具 flake8 和格式化工具 yapf 2. 在 VSCode 中配置 # Kite AI 支持代码补全插件，用着挺有意思 # Jupyter Notebook 可交互式计算 Web 程序，主要用来测试 Python 程序 ## 主题 1. 更换主题， 目前使用的为 jt -t onedork -f consolamono -fs 13 -tfs 15 -nfs 15 -ofs 13 -cellw 88% -T 2. 解决 matplotlib 与主题适配问题: 在 ~/.ipython/profile_default/startup/startup.ipy 中加入 若想重置图片颜色，比如保存图片时，在 notebook 中添加 jtplot.reset() 即可。 ## 插件 插件推荐: • LaTeX environments for Jupyter: 方便写 LaTeX。 • ScrollDown: output 窗口自动滚动。 • Table of Contents: 显示 Markdown 大纲视图。 • Variable Inspector: 方便查看变量类型与维度。 • Hinterland: 代码自动提示。 • Spellchecker: 检查 Markdown 拼写错误。 • Scratchpad: 调试小窗口。 # iTerm2 macOS 终端工具。一些设置： • 默认终端: iTerm2 -> Make iTerm2 Default Term • 快捷键: iTerm2 -> Preferences -> Keys -> Hotkey。设置为 option + space • 字体问题: 安装 Nerd Fonts，在 iTerm2 -> Preferences -> Profiles -> Text 中选择字体。推荐 14 pt MesloLGM Nerd Font Regular • 下载 iTerm2 的配色方案。推荐 Snazzy, ayu 备份： • 自动备份设置：Preferences -> General -> Preferences -> Load preferences from a cutom folder or URL • 手动备份 Profiles：Preferences -> Profiles -> Other Actions -> Save all Profiles as JSON # Oh My Zsh zsh 增强。开启一些自带插件： • git：关于 git 的各种 alias • extract：使用 x 傻瓜式解压文件 • colored-man-pages：man 命令高亮 ## Powerlevel10k zsh 的一款主题，可定制化选项多，傻瓜式引导配置 ## zsh-autosuggestions zsh 插件，可以根据历史提示命令，一键补全 ## zsh-syntax-highlighting zsh 插件，高亮命令。 注意需要放在 .zshrc 插件列表最后一位 # Vim 命令行编辑器，目前小型文件全部用 Vim 取代 Sublime，大型项目使用 VSCode。 新手使用 vimtutor 命令来快速熟悉，学习曲线陡峭。 安装 vim-plug 插件管理器；插件可以在 VimAwesome 中搜索 # Tmux 终端复用器，ssh 连接断开时会话依旧会保存。每个会话可以打开多个窗口，每个窗口可以划分窗格。 一个会话中的操作对服务器端和用户端同时可见，可以用来做演示。 安装 tpm 插件管理器 ## Dracula Tmux Tmux 主题插件，显示简单信息，容易定制 ## vim+tmux+true_color+italic 1. .tmux.conf 中加入 2. 下载 tmux-256color 后终端执行 /usr/bin/tic -x tmux-256color 后会产生 ~/.terminfo 文件 # Ranger 命令行文件管理器，按键模式和 Vim 类似。支持自定义快捷键和命令。新手请阅读 wiki ==Bug==：快速预览有时会卡死，需要 ctrl-c 后继续。参考 issue # Yadm Yet Another Dotfiles Manager. Dotfiles 管理工具。基本工作原理即是在 ~ 目录下建立一个 git 仓库，命令与 git 相同，模式与 git 相反，即默认不添加文件，需要手动 yadm add # 命令行工具 • autojump: 目录快速跳转 • bat: cat 增强版，输出高亮，多种语法 • cheat: 提示命令的用法，可以自己编写 • icdiff: 支持高亮的比对工具，可以搭配 git 使用 • lsd: ls 增强版 • neofetch: 显示系统信息，可定制选项多 • ripgrep: 类似于 grep，使用正则表达式搜索文件内容 • tree: 树形显示文件夹结构 • htop: 交互式进程管理器 • wudao-dict: 有道词典命令行版本 • nvidia-htop: nvidia-smi 增强版，显示进程对应用户与 CPU，Memory 使用率 • xclip (Linux): 复制文件内容至剪切板，Mac 自带 pbcopy # 个性化 • Downlink: 用实时卫星图片做桌面背景 • Matrix: 类似黑客帝国的屏幕保护程序 # Hexo 博客搭建 参照 Readme # 系统相关 ## Win & Ubuntu 双系统 ### Win 安装 Ubuntu 进入 Win 下控制面板中的电源选项关闭快速启动。 使用磁盘管理工具在 Win 下将磁盘压缩出一块可用空间，100 GB 左右。==如果硬盘为动态磁盘，需先使用傲梅分区助手将磁盘转化为基本磁盘。== 在 Ubuntu 官网下载安装文件。按照官网教程制作装机 U 盘。 U 盘插入电脑，重启进入 BIOS 选择从 U 盘启动。按照安装步骤进行，安装类型页面选择与 Win 共存。 安装完成后，在 Software -> Additional Drivers 里安装 Nvidia 显卡驱动，如果未关闭 Secure Boot，则需要设置密码并重启，在界面中选择 Enroll Key 并输入密码即可。 网上大多数教程还是以 Legacy 模式安装，关闭 Secure Boot，手动分区，用软件制作引导。而目前的版本完全不需要这么麻烦 ### Ubuntu 安装 Win 需要一台 Win 电脑制作装机 U 盘 在 Microsoft 官网下载安装工具，==(注意必须要在 Win 电脑上打开才会显示制作工具下载，否则会直接跳转至 ISO 文件下载)==。按照说明制作装机 U 盘。 U 盘插入电脑，重启进入 BIOS 选择从 U 盘启动，按照安装步骤进行。 若提示因为 MBR 分区表无法安装至该磁盘，使用 Shift + F10 进入命令行，输入以下命令将磁盘转换为 GPT 后可以继续安装： 添加 Ubuntu 开机引导：安装完成后在 BIOS 中将 Ubuntu 选为第一启动项，Ubuntu 系统下更改 /etc/default/grubGRUB_DEFAULT=4 后执行 sudo update-grub，在输出内容看到 Found Windows ... 即可 ## WOL ==首先查看主板是否支持 WOL== 开启 WOL 服务: • Ubuntu: 开启 BIOS 中 WOL (Wake on Lan)。安装 sudo apt-get install ethtool 并开启网卡的 WOL 服务 sudo ethtool -s eth0 wol g • Mac: 开启 Energy Saver -> Power Adapter -> Wake for Wi-Fi network access 发送 Magic Packet 以唤醒其它机器: • 安装 wakeonlan，使用 wakeonlan -i ip -p port macaddress # Ubuntu Fix ## 鼠标滚轮速度设置 下载 imwheel，创建 ~/.imwheelrc 文件并加入如下内容 并添加开机启动: 执行命令 gnome-session-properties，添加 imwheel -k -b "4 5" ## DP 没有音频传输 更改显示器选项为 DP 1.1，电脑 Settings -> Sound -> Output 中出现 HDMI/DisplayPort 选项。 可能还需要在 pavucontrol -> Configuration 中更改 profile ## 应用加入启动器 /usr/share/applications 新加入 xxx.destop 文件，加入以下内容 ## 蓝牙自动连接 终端操作: # Tips ## SSH With Tmux 使用 ssh NAME -t tmux new -A 可以在 ssh 时自动 attach 到 tmux session，如果没有则自动创建 ## History 增加时间显示 .zshrc 中添加 HIST_STAMPS="yyyy-mm-dd" ## Mac System Copy pbcopy 只能够复制文本内容，如果达到系统复制的效果可以将下列函数加入 .zshrc 通过 syscp FILE 来复制文件，注意只能复制单个文件 ]]> <p><img src="https://i.imgur.com/3XR2qVc.jpg" alt="Demo" /></p> <p>收集当前使用的工具及配置。其它相关 Blog:</p> <ul> <li><a href="https://www.notion.so/silencial/6dedf8a81bf840528e2b442477a3928c">常用软件</a>: Notion 个人主页</li> <li><a href="https://silencial.github.io/cs-tools/">CS Tools</a>: 记录 CS 中常用工具</li> </ul> Deep Reinforcement Learning (Part 3) https://silencial.github.io/deep-reinforcement-learning-3/ 2020-02-07T00:00:00.000Z 2020-02-13T00:00:00.000Z Berkeley CS 285 Review My solution to the homework Deep Reinforcement Learning (Part 1) Deep Reinforcement Learning (Part 2) # Variational Inference Implicit latent variable model can be used to model complicated distributions: $p(x) = \int p(x|z)p(z)dz$ and the maximum likelihood fit becomes $\DeclareMathOperator*{\argmin}{\arg\min}\DeclareMathOperator*{\argmax}{\arg\max}\theta \leftarrow \argmax _{ \theta } \frac { 1 } { N } \sum_ { i } \log \left( \int p _{ \theta } \left( x_ { i } | z \right) p ( z ) d z \right)$ the integral is intractable so we replace it by expectation $\begin{equation}\theta \leftarrow \argmax _{ \theta } \frac { 1 } { N } \sum_ { i } \mathbb{E}_{z \sim p(z | x_i)} \left[ \log p _ { \theta } \left( x _ { i } | z \right) \right]\label{maxlikely}\end{equation}$ We use another distribution $$q(z)$$ to approximate $$p(z | x_i)$$, with KL-divergence as metrics: \begin{aligned}D_{\mathrm{KL}}(q_i(z) \| p(z | x_i)) &= -\mathbb{E}_{z \sim q_{i}(z)}\left[\log p\left(x_{i} | z\right)+\log p(z)\right]+ \mathbb{E}_{z \sim q_{i}(z)}\left[\log q_{i}(z)\right] + \log p(x_i) \\&= - \mathcal{L}_i(p, q_i) + \log p(x_i)\end{aligned} $$\mathcal{L}_i (p, q_i)$$ is called the evidence lower bound (ELBO), since $$\log p(x_i) \ge \mathcal{L}_i(p, q_i)$$. Maximizing ELBO w.r.t. $$q_i$$ minimizes KL divergence. So now $$\eqref{maxlikely}$$ becomes $\theta \leftarrow \argmax _{ \theta } \frac { 1 } { N } \sum_ { i } \mathcal{L}_i(p, q_i)$ $$\mathcal{L}_{i}\left(p, q_{i}\right)=\mathbb{E}_{z \sim q_{i}(z)}\left[\log p\left(x_{i} | z\right)+\log p(z)\right]+\mathcal{H}(q)$$. The reason to optimize over two parts can be seen below: Update scheme: 1. sample $$z \sim q_i(x_i)$$ 2. calculate $$\nabla_\theta \mathcal{L}_i(p, q_i) \approx \nabla_\theta \log p_\theta(x_i | z)$$ 3. $$\theta \leftarrow \theta + \alpha \nabla_\theta \mathcal{L}_i(p, q_i)$$ 4. update $$q_i$$ to maximize $$\nabla_\theta \mathcal{L}_i(p, q_i)$$ If $$q_i(z) = \mathcal{N}(\mu_i, \sigma_i)$$, step 4 becomes gradient ascent on $$\mu_i, \sigma_i$$ Problems: parameter size too large $$|\theta|+\left(\left|\mu_{i}\right|+\left|\sigma_{i}\right|\right) \times N$$ ## Amortized Variational Inference If we can learn a network $$q_\phi(z | x) = \mathcal{N}(\mu_\phi(x), \sigma_\phi(x))$$, step 4 will become $$\phi \leftarrow \phi + \nabla_\phi \mathcal{L}_i(p_\theta, q_\phi)$$. Parameter size problem can be solved. Expand $$\mathcal{L}_i$$: \begin{aligned}\mathcal { L }_ { i } &= \mathbb{E} _{ z \sim q_ { \phi } ( z | x _{ i } ) } \left[ \log p _ { \theta } \left( x _ { i } | z \right) + \log p ( z ) \right] + \mathcal { H } \left( q_ { \phi } ( z | x _{ i } ) \right) \\&= \mathbb{E}_{ z \sim q _{ \phi } ( z | x_ { i } ) } [r(x_i, z)] + \mathcal { H } \left( q _{ \phi } ( z | x_ { i } ) \right) \\&= J(\phi) + \mathcal { H } \left( q _{ \phi } ( z | x_ { i } ) \right)\end{aligned} The second term is the entropy of a Gaussian. For the first term, we can use policy gradient: $\nabla J(\phi) \approx \frac{1}{M} \sum_j \nabla_\phi \log q _{ \phi } ( z_j | x _{ i } ) r(x_i, z_j)$ Another way to do it is to use reparameterization trick: \begin{aligned}J ( \phi ) & = \mathbb{E}_ { z \sim q _{ \phi } \left( z | x_ { i } \right) } \left[ r \left( x _ { i } , z \right) \right] \\& = \mathbb{E} _{ \epsilon \sim \mathcal { N } ( 0,1 ) } \left[ r \left( x _ { i } , \mu _ { \phi } \left( x _ { i } \right) + \epsilon \sigma _ { \phi } \left( x _ { i } \right) \right) \right]\end{aligned} so that $$\nabla J(\phi)$$ can be computed by samples $$\epsilon_j$$ from $$\mathcal{N}(0, 1)$$: $J ( \phi ) \approx \frac{1}{M} \sum_j \nabla_\phi r \left( x _{ i } , \mu_ { \phi } \left( x _{ i } \right) + \epsilon_j \sigma _{ \phi } \left( x_ { i } \right) \right)$ ## Variational Autoencoders Encoder: $$q_\phi(z | x) = \mathcal{N}(\mu_\phi(x), \sigma_\phi(x))$$ Decoder: $$p_\theta(x | z) = \mathcal{N}(\mu_\theta(z), \sigma_\theta(z))$$ Algorithm: The first term maximize the likelihood from the decoder, the second term makes the encoder and the prior stay close. $\max _{\theta, \phi} \frac{1}{N} \sum_{i} \log p_{\theta}\left(x_{i} | \mu_{\phi}\left(x_{i}\right)+\epsilon \sigma_{\phi}\left(x_{i}\right)\right)-D_{\mathrm{KL}}\left(q_{\phi}\left(z | x_{i}\right) \| p(z)\right)$ # Control as Inference Problem ## Graphical Model Deterministic model of decision making: $\mathbf{a}_1, \dots, \mathbf{a}_T = \argmax_{\mathbf{a}_1, \dots, \mathbf{a}_T} \sum_{t=1}^T r(\mathbf{s}_t, \mathbf{a}_t) \\\mathbf{s}_{t+1} = f(\mathbf{s}_t, \mathbf{a}_t)$ which has one optimal solution. In order to explain stochastic behavior, we need a probabilistic graphical model: $\begin{equation}p(\mathcal{O}_t | \mathbf{s}_t, \mathbf{a}_t) = \exp(r(\mathbf{s}_t, \mathbf{a}_t)) \\p \left( \tau | \mathcal { O } _{ 1 : T } \right) = \frac { p \left( \tau , \mathcal { O }_ { 1 : T } \right) } { p \left( \mathcal { O } _{ 1 : T } \right) } \propto p ( \tau ) \exp \left( \sum_ { t } r \left( \mathbf { s } _{ t } , \mathbf { a }_ { t } \right) \right)\label{pgm}\end{equation}$ where $$\mathcal{O}$$ represent the binary optimality. Pros: • Can model suboptimal behavior (important for inverse RL). • Can apply inference algorithms to solve control and planning problems. • Provides an explanation for why stochastic behavior might be preferred (useful for exploration and transfer learning). Steps to do Inference: 1. compute backward messages $$\beta_t(\mathbf{s}_t, \mathbf{a}_t) = p(\mathcal{O}_{t:T} | \mathbf{s}_t, \mathbf{a}_t)$$ 2. compute policy $$p(\mathbf{a}_T | \mathbf{s}_T, \mathcal{O}_{1:T})$$ 3. compute forward messages $$\alpha_t(\mathbf{s}_t) = p(\mathbf{s}_t | \mathcal{O}_{1:t-1})$$ ## Backward Message \begin{aligned}\beta _ { t } \left( \mathbf { s } _ { t } , \mathbf { a } _ { t } \right) & = p \left( \mathcal { O } _ { t : T } | \mathbf { s } _ { t } , \mathbf { a } _ { t } \right) \\& = \int p \left( \mathcal { O } _ { t : T } , \mathbf { s } _ { t + 1 } | \mathbf { s } _ { t } , \mathbf { a } _ { t } \right) d \mathbf { s } _ { t + 1 } \\& = \int p \left( \mathcal { O } _ { t + 1 : T } | \mathbf { s } _ { t + 1 } \right) p \left( \mathbf { s } _ { t + 1 } | \mathbf { s } _ { t } , \mathbf { a } _ { t } \right) p \left( \mathcal { O } _ { t } | \mathbf { s } _ { t } , \mathbf { a } _ { t } \right) d \mathbf { s } _ { t + 1 }\end{aligned} The second and third term are known, the first term can be expressed as \begin{aligned}\beta _{ t } \left( \mathbf { s }_ { t+1 } \right) &= p \left( \mathcal { O } _{ t + 1 : T } | \mathbf { s }_ { t + 1 } \right) \\&= \int p \left( \mathcal { O } _{ t + 1 : T } | \mathbf { s }_ { t + 1 }, \mathbf { a } _{ t + 1 } \right) p(\mathbf { a }_ { t + 1 } | \mathbf { s } _{ t + 1 }) d \mathbf { a }_ { t + 1 } \\&= \int \beta _{ t } \left( \mathbf { s }_ { t+1 } , \mathbf { a } _{ t+1 } \right) p(\mathbf { a }_ { t + 1 } | \mathbf { s } _{ t + 1 }) d \mathbf { a }_ { t + 1 }\end{aligned} Without loss of generality, assume $$p(\mathbf { a } _{ t + 1 } | \mathbf { s }_ { t + 1 })$$ is uniform (Proof). Then the algorithm is 1. For $$t=T-1$$ to $$1$$ 2. $$\beta _{ t } \left( \mathbf { s }_ { t } , \mathbf { a } _{ t } \right) = p \left( \mathcal { O }_ { t } | \mathbf { s } _{ t } , \mathbf { a }_ { t } \right) \mathbb{E} _{ \mathbf { s }_ { t + 1 } \sim p \left( \mathbf { s } _{ t + 1 } | \mathbf { s }_ { t } , \mathbf { a } _ { t } \right) } \left[ \beta _ { t + 1 } \left( \mathbf { s } _ { t + 1 } \right) \right]$$ 3. $$\beta _{ t } \left( \mathbf { s }_ { t } \right) = E _{ \mathbf { a }_ { t } \sim p \left( \mathbf { a } _{ t } | \mathbf { s }_ { t } \right) } \left[ \beta _ { t } \left( \mathbf { s } _ { t } , \mathbf { a } _ { t } \right) \right]$$ Relationship to value iteration algorithm: Let $$V(\mathbf{s}_t) = \log \beta_t(\mathbf{s}_t)$$ and $$Q(\mathbf { s }_ { t } , \mathbf { a } _{ t }) = \log \beta_t(\mathbf { s } _{ t } , \mathbf { a }_ { t })$$, then the backward pass can be written as $\begin{equation}Q(\mathbf { s } _{ t } , \mathbf { a }_ { t }) = r(\mathbf { s } _{ t } , \mathbf { a }_ { t }) + \log \mathbb{E} \big[ \exp(V(\mathbf { s } _ { t+1})) \big] \\V(\mathbf{s}_t) = \log \int \exp(Q(\mathbf { s }_ { t } , \mathbf { a } _{ t })) d \mathbf{a}_t\label{backmessage}\end{equation}$ Note that when $$Q$$ gets bigger, $$V \rightarrow \max_{\mathbf{a}_t} Q$$ ## Compute Policy \begin{aligned}p(\mathbf{a}_t | \mathbf{s}_t,\mathcal{O}_{1:T}) &= \pi (\mathbf{a}_t | \mathbf{s}_t) \\&= p(\mathbf{a}_t | \mathbf{s}_t, \mathcal{O}_{t:T}) \\&= \frac{p(\mathcal{O}_{t:T} | \mathbf{s}_t, \mathbf{a}_t) p(\mathbf{s}_t, \mathbf{a}_t)}{p(\mathcal{O}_{t:T} | \mathbf{s}_t) p(\mathbf{s}_t)} \\&= \frac{\beta_t(\mathbf{s}_t, \mathbf{a}_t)}{\beta_t(\mathbf{s}_t)} p(\mathbf{a}_t | \mathbf{s}_t)\end{aligned} Ignore the action prior we can get that $$\pi (\mathbf{a}_t | \mathbf{s}_t) = \beta_t(\mathbf{s}_t, \mathbf{a}_t) / \beta_t(\mathbf{s}_t)$$. Plug in $$Q$$, $$V$$ we can see $$\pi \left( \mathbf { a }_ { t } | \mathbf { s } _{ t } \right) = \exp \left( Q \left( \mathbf { s }_ { t } , \mathbf { a } _{ t } \right) - V \left( \mathbf { s }_ { t } \right) \right) = \exp \left( A \left( \mathbf { s } _{ t } , \mathbf { a }_ { t } \right) \right)$$ ## Forward Message \begin{aligned}\alpha_t(\mathbf{s}_t) &= p(\mathbf{s}_t | \mathcal{O}_{1:t-1}) \\&= \int p(\mathbf{s}_t, \mathbf{s}_{t-1}, \mathbf{a}_{t-1} | \mathcal{O}_{1:t-1}) d\mathbf{s}_{t-1} d\mathbf{a}_{t-1} \\&= \int p ( \mathbf { s } _ { t } | \mathbf { s } _ { t - 1 } , \mathbf { a } _ { t - 1 } ) p ( \mathbf { a } _ { t - 1 } | \mathbf { s } _ { t - 1 } , \mathcal { O } _ { t - 1 } ) p ( \mathbf { s } _ { t - 1 } | \mathcal { O } _ { 1 : t - 1 } ) d \mathbf { s } _ { t - 1 } d \mathbf { a } _ { t - 1 }\end{aligned} The first term is the dynamics, the rest can be further simplified: \begin{aligned}p ( \mathbf { a } _{ t - 1 } | \mathbf { s }_ { t - 1 } , \mathcal { O } _{ t - 1 } ) p ( \mathbf { s }_ { t - 1 } | \mathcal { O } _{ 1 : t - 1 } ) &= \frac{p ( \mathcal { O }_ { t - 1 } | \mathbf { s } _{ t - 1 } , \mathbf { a }_ { t - 1 } ) p ( \mathbf { a } _{ t - 1 } | \mathbf { s }_ { t - 1 } )}{p ( \mathcal { O } _{ t - 1 } | \mathbf { s }_ { t - 1 } )} \frac{ p ( \mathcal { O } _{ t - 1 } | \mathbf { s }_ { t - 1 } ) p ( \mathbf { s } _{ t - 1 } | \mathcal { O }_ { 1 : t - 2 } )}{p(\mathcal { O } _{ t - 1 } | \mathcal { O }_ { 1 : t - 2 })} \\&= \frac{p ( \mathcal { O } _{ t - 1 } | \mathbf { s }_ { t - 1 } , \mathbf { a } _{ t - 1 } ) p ( \mathbf { a }_ { t - 1 } | \mathbf { s } _{ t - 1 } )}{p(\mathcal { O }_ { t - 1 } | \mathcal { O } _{ 1 : t - 2 })} \alpha_{t-1}(\mathbf{s}_{t-1})\end{aligned} If we want $$p(\mathbf{s}_t | \mathcal{O}_{1:T})$$: $p(\mathbf{s}_t | \mathcal{O}_{1:T}) = \frac{p(\mathcal{O}_{t:T} | \mathbf{s}_t) p(\mathbf{s}_t \mathcal{O}_{1:t-1})}{p(\mathcal{O}_{1:T})} \propto \beta_t(\mathbf{s}_t) \alpha_t(\mathbf{s}_t)$ ## Optimism Problem In the backward pass $$\eqref{backmessage}$$ is optimistic about the $$Q$$ since $$\log\mathbb{E} \big[ \exp(V_{t+1}(\mathbf { s } _ { t+1})) \big]$$ is like the maximum. This happens because the dynamics from the inference problem is not real dynamics. Marginalizing the inference problem $$p(\mathbf{s}_{1:T}, \mathbf{a}_{1:T} | \mathcal{O}_{1:T})$$ to get • policy: $$p(\mathbf{a}_{t} | \mathbf{s}_{t}, \mathcal{O}_{1:T})$$ (given high reward obtained, what was the action probability? This is good.) • dynamics: $$p(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}, \mathcal{O}_{1:T}) \ne p(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t})$$ (given high reward obtained, what was the transition probability? This is bad.) The solution is to find a distribution $$q(\mathbf{s}_{1:T}, \mathbf{a}_{1:T})$$ that is close to $$p(\mathbf{s}_{1:T}, \mathbf{a}_{1:T} | \mathcal{O}_{1:T})$$ but has dynamics $$p(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t})$$. Let $$\mathbf{x} = \mathcal{O}_{1:T}$$ and $$\mathbf{z} = (\mathbf{s}_{1:T}, \mathbf{a}_{1:T})$$, $$q(\mathbf{z})$$ can be built by ==(same dynamics and initial state as $$p$$)== $q(\mathbf{s}_{1:T}, \mathbf{a}_{1:T}) = p(\mathbf{s}_{1}) \prod_t p(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}) q(\mathbf{a}_{t} | \mathbf{s}_{t})$ Applying the ELBO and solve for $$q(\mathbf{a}_{t} | \mathbf{s}_{t})$$ by dynamic programming we can get (Proof) $\begin{equation}Q(\mathbf { s } _{ t } , \mathbf { a }_ { t }) = r(\mathbf { s } _{ t } , \mathbf { a }_ { t }) + \mathbb{E} \big[ V(\mathbf { s } _ { t+1}) \big] \\V(\mathbf{s}_t) = \log \int \exp(Q(\mathbf { s }_ { t } , \mathbf { a } _{ t })) d \mathbf{a}_t \\q ( \mathbf{a}_t | \mathbf{s}_t) = \exp(Q(\mathbf{s}_t, \mathbf{a}_t) - V(\mathbf{s}_t))\label{softoptimal}\end{equation}$ which corresponds to a standard Bellman backup with a soft maximization for the value function. ## Soft Optimality Value iteration algorithm with soft optimality: 1. set $$Q(\mathbf { s }, \mathbf { a }) \leftarrow r(\mathbf { s }, \mathbf { a }) + \gamma\mathbb{E} \big[ V(\mathbf { s' }) \big]$$ 2. set $$V(\mathbf{s}) \leftarrow \text{softmax}_\mathbf{a} Q(\mathbf { s }, \mathbf { a })$$ $$Q$$-learning with soft optimality: compute $$y_j = r_j + \gamma \operatorname{softmax}_{\mathbf{a}_j'} Q_{\phi'}(\mathbf{s}_j', \mathbf{a}_j')$$ Policy gradient with soft optimality: $$J ( \theta ) = \sum _{ t } \mathbb{E}_ { \pi \left( \mathbf { s } _{ t } , \mathbf { a }_ { t } \right) } \left[ r \left( \mathbf { s } _ { t } , \mathbf { a } _ { t } \right) \right] + \mathbb{E} _{ \pi \left( \mathbf { s }_ { t } \right) } [ \mathcal { H } \left( \pi \left( \mathbf { a } | \mathbf { s } _ { t } \right) \right) ] = \sum _{ t } \mathbb{E}_ { \pi \left( \mathbf { s } _{ t } , \mathbf { a }_ { t } \right) } \left[ r \left( \mathbf { s } _ { t } , \mathbf { a } _ { t } \right) - \log \pi \left( \mathbf { a } _ { t } | \mathbf { s } _ { t } \right) \right]$$ Pros of soft optimality: • Improve exploration and prevent entropy collapse • Easier to specialize (fine tune) policies for more specific tasks • Principled approach to break ties • Better robustness (due to wider coverage of states) • Can reduce to hard optimality as reward magnitude increases • Good model for modeling human behavior # Inverse RL Learn the reward function from observing an expert, and then use RL. Forward RL: Given states $$\mathbf{s} \in \mathcal{S}$$, actions $$\mathbf{a} \in \mathcal{A}$$, transitions $$p(\mathbf{s}^{\prime} | \mathbf{s}, \mathbf{a})$$ (sometimes), reward function $$r(\mathbf{s}, \mathbf{a})$$. Learn $$\pi^{\star}(\mathbf{a} | \mathbf{s})$$. Inverse RL: Given states $$\mathbf{s} \in \mathcal{S}$$, actions $$\mathbf{a} \in \mathcal{A}$$, transitions $$p(\mathbf{s}^{\prime} | \mathbf{s}, \mathbf{a})$$ (sometimes), samples $$\{ \tau_i \}$$ from $$\pi^{\star}(\tau)$$. Learn reward function $$r_\psi(\mathbf{s}, \mathbf{a})$$. ## Feature Matching IRL Assume linear reward function: $$r_{\psi}(\mathbf{s}, \mathbf{a})=\sum_{i} \psi_{i} f_{i}(\mathbf{s}, \mathbf{a})=\psi^{T} \mathbf{f}(\mathbf{s}, \mathbf{a})$$ So we can choose $$\psi$$ s.t. $$\mathbb{E}_{\pi^{r_\psi}}[\mathbf{f}(\mathbf{s}, \mathbf{a})] = \mathbb{E}_{\pi^{*}}[\mathbf{f}(\mathbf{s}, \mathbf{a})]$$ to match the expectation of the important features $$\mathbf{f}$$. Since there could be multiple reward functions, we choose the one with the maximum margin: $\max _{\psi, m} m \\\text {s.t.} \quad \psi^{T} \mathbb{E}_{\pi^{*}}[\mathbf{f}(\mathbf{s}, \mathbf{a})] \geq \max _{\pi \in \Pi} \psi^{T} \mathbb{E}_{\pi}[\mathbf{f}(\mathbf{s}, \mathbf{a})]+m$ using the SVM trick, the optimization problem becomes: $\min _{\psi} \frac{1}{2}\|\psi\|^{2} \\\text {s.t.} \quad \psi^{T} \mathbb{E}_{\pi^{*}}[\mathbf{f}(\mathbf{s}, \mathbf{a})] \geq \max _{\pi \in \Pi} \psi^{T} \mathbb{E}_{\pi}[\mathbf{f}(\mathbf{s}, \mathbf{a})]+D\left(\pi, \pi^{\star}\right)$ where $$D\left(\pi, \pi^{\star}\right)$$ represent the difference between policies. ## MaxEnt IRL From $$\eqref{pgm}$$ we can get another way to learn the reward: $\max _{\psi} \frac{1}{N} \sum_{i=1}^{N} \log p\left(\tau_{i} | \mathcal{O}_{1: T}, \psi\right)=\max _{\psi} \frac{1}{N} \sum_{i=1}^{N} r_{\psi}\left(\tau_{i}\right)-\log Z \\Z = \int p(\tau) \exp( r_\psi(\tau)) d\tau$ Taking the derivative we can get \begin{aligned}\nabla_{\psi} \mathcal{L} &= \frac{1}{N} \sum_{i=1}^{N} \nabla_{\psi} r_{\psi}\left(\tau_{i}\right)-\frac{1}{Z} \int p(\tau) \exp \left(r_{\psi}(\tau)\right) \nabla_{\psi} r_{\psi}(\tau) d \tau \\&= \mathbb{E}_{\tau \sim \pi^{\star}(\tau)}\left[\nabla_{\psi} r_{\psi}\left(\tau_{i}\right)\right] - \mathbb{E}_{\tau \sim p\left(\tau | \mathcal{O}_{1: T}, \psi\right)}\left[\nabla_{\psi} r_{\psi}(\tau)\right]\end{aligned} The first term can be computed by sampling from the expert policy, the second term needs more work \begin{aligned}\mathbb{E}_{\tau \sim p\left(\tau | \mathcal{O}_{1: T}, \psi\right)}\left[\nabla_{\psi} r_{\psi}(\tau)\right] &= \mathbb{E}_{\tau \sim p\left(\tau | \mathcal{O}_{1: T}, \psi\right)}\left[\nabla_{\psi} \sum_{t=1}^{T} r_{\psi}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right] \\&= \sum_{t=1}^{T} \mathbb{E}_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim p\left(\mathbf{s}_{t}, \mathbf{a}_{t} | \mathcal{O}_{1: T}, \psi\right)}\left[\nabla_{\psi} r_{\psi}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right]\end{aligned} $$p\left(\mathbf{s}_{t}, \mathbf{a}_{t} | \mathcal{O}_{1: T}, \psi\right)$$ can be further decomposed to $p\left(\mathbf{a}_{t} | \mathbf{s}_{t}, \mathcal{O}_{1: T}, \psi\right) p\left(\mathbf{s}_{t} | \mathcal{O}_{1: T}, \psi\right) \propto \beta(\mathbf{s}_t, \mathbf{a}_t) \alpha(\mathbf{s}_t)$ So we have the Max Entropy IRL algorithm: 1. Given $$\psi$$, compute backward message $$\beta(\mathbf{s}_t, \mathbf{a}_t)$$ and forward message $$\alpha(\mathbf{s}_t)$$ 2. Compute $$\mu_t(\mathbf{s}_t, \mathbf{a}_t) \propto \beta(\mathbf{s}_t, \mathbf{a}_t) \alpha(\mathbf{s}_t)$$ 3. Evaluate $$\displaystyle \nabla_{\psi} \mathcal{L}=\frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\psi} r_{\psi}\left(\mathbf{s}_{i, t}, \mathbf{a}_{i, t}\right)-\sum_{t=1}^{T} \iint \mu_{t}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \nabla_{\psi} r_{\psi}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) d \mathbf{s}_{t} d \mathbf{a}_{t}$$ 4. $$\psi \leftarrow \psi + \eta \nabla_\psi \mathcal{L}$$ This algorithm is called MaxEnt because it optimizes $$\max_\psi \mathcal{H}(\pi^{r_\psi})$$ s.t. $$\mathbb{E}_{\pi^{r_\psi}}[\mathbf{f}] = \mathbb{E}_{\pi^{*}}[\mathbf{f}]$$ ## Unknown Dynamics Learn the dynamics $$p(\mathbf{a}_t | \mathbf{s}_t, \mathcal{O}_{1:T}, \psi)$$ using any MaxEnt RL algorithm, then run this policy to sample $$\{\tau_j\}$$. $$\nabla_\psi \mathcal{L}$$ becomes $\nabla_{\psi} \mathcal{L} \approx \frac{1}{N} \sum_{i=1}^{N} \nabla_{\psi} r_{\psi}\left(\tau_{i}\right)-\frac{1}{M} \sum_{j=1}^{M} \nabla_{\psi} r_{\psi}\left(\tau_{j}\right)$ The first sum uses expert samples while the second uses the policy samples. Instead of learning the dynamics, use importance sampling (expert -> policy) to make the algorithm more efficient: $\nabla_{\psi} \mathcal{L} \approx \frac{1}{N} \sum_{i=1}^{N} \nabla_{\psi} r_{\psi}\left(\tau_{i}\right)-\frac{1}{\sum_{j} w_{j}} \sum_{j=1}^{M} w_{j} \nabla_{\psi} r_{\psi}\left(\tau_{j}\right) \\w_{j}=\frac{p(\tau) \exp \left(r_{\psi}\left(\tau_{j}\right)\right)}{\pi\left(\tau_{j}\right)} = \frac{\exp(\sum_t r_\psi(\mathbf{s}_t, \mathbf{a}_t))}{\prod_t \pi(\mathbf{a}_t | \mathbf{s}_t)}$ ## Inverse RL as GAN The best discriminator in a GAN is $$D^{\star}(\mathbf{x})=\dfrac{p^{\star}(\mathbf{x})}{p_{\theta}(\mathbf{x})+p^{\star}(\mathbf{x})}$$. In IRL, optimal policy approaches $$\pi_{\theta}(\tau) \propto p(\tau) \exp \left(r_{\psi}(\tau)\right)$$, plug it in to get \begin{aligned}D_{\psi}(\tau) &= \frac{p(\tau) \frac{1}{Z} \exp (r(\tau))}{p_{\theta}(\tau)+p(\tau) \frac{1}{Z} \exp (r(\tau))} \\&= \frac{\frac{1}{Z} \exp (r(\tau))}{\prod_t \pi_\theta (\mathbf{a}_t | \mathbf{s}_t) + \frac{1}{Z} \exp (r(\tau))}\end{aligned} and optimize it w.r.t. $$\psi$$ $\psi \leftarrow \arg \max_{\psi} \mathbb{E}_{\tau \sim p^{*}}\left[\log D_{\psi}(\tau)\right]+\mathbb{E}_{\tau \sim \pi_{\theta}}\left[\log \left(1-D_{\psi}(\tau)\right)\right]$ # Transfer Learning Definition: Using experience from one set of tasks for faster learning and better performance on a new task. Prior understanding of problem structure can help us solve complex tasks quickly. RL store the prior knowledge in: • Q-function: which actions or states are good • Policy: which actions are potentially useful • Models: laws of physics that govern the world • Feature/Hidden states: good representation ## Forward Transfer • Fine-tuning: The most popular TL method in (supervised) deep learning. Challenges: 1. RL tasks are generally much less diverse • Features are less general • Policies & value functions become overly specialized 2. Optimal policies in fully observed MDPs are deterministic • Loss of exploration at convergence • Low-entropy policies adapt very slowly to new settings If we can manipulate the source domain, and we have a difficult target domain (sim to real transfer), randomize the source domain to add more diversity at training time to make the model more flexible. If we can have some prior knowledge about the target domain, use domain adaption (GAN) to make the network unable to distinguish observations from the two domains. ## Multi-Task Transfer More diversity = Better transfer. Transfer from multiple different tasks is closer to what people do. • Model-based RL: train model on past tasks and use it to solve new tasks; or fine-tuning the model. • Model distillation: Instead of learning a model, learn a multi-task policy that can simultaneously perform many tasks. Construct a joint MDP, train each separately, then combine the policies. • Contextual policies: Policies are told what to do in the same environment. • Modular network: Architectures (neural network) with reusable components. # Exploration Exploitation: doing what you know will yield highest reward. Exploration: doing things you haven't done before, in the hopes of getting even higher reward. Assume $$r(a_i) \sim p_{\theta_i}(r_i)$$ and define the regret as the difference from optimal policy at time step $$T$$: $$\text{Reg}(T) = T \mathbb{E}[r(a^\star)] - \sum_{t=1}^T r(a_t)$$ ## Multi-Arm Bandits Problem First let's discuss exploration in simple 1-step stateless RL problems. ### Optimistic Exploration Keep track of average reward $$\hat{\mu}_a$$ for each action $$a$$ and pick the action by $$a = \argmax \hat{\mu}_a + C \sigma_a$$. The intuition behind this algorithm is to try each action until you are sure it is not great. One popular model is UCB (Upper Confidence Bound): $a=\argmax \hat{\mu}_{a}+\sqrt{\frac{2 \ln T}{N(a)}}$ which yields $$\text{Reg}(T)$$ to $$\mathcal{O}(\log T)$$. ### Posterior Sampling $$r(a_i) \sim p_{\theta_i}(r_i)$$ defines a POMDP with $$\mathbf{s} = [\theta_q, \cdots, \theta_n]$$, belief state is $$\hat{p}(\theta_1, \cdots, \theta_n)$$. 1. Sample $$\theta_1, \cdots, \theta_n \sim \hat{p}(\theta_1, \cdots, \theta_n)$$ 2. Use the $$\theta_1, \cdots, \theta_n$$ model to take the optimal action 3. update the model ### Information Gain Bayesian experimental design: Learn some latent variable $$z$$ and use it to choose action. Let $$\mathcal{H}(\hat{p}(z))$$ be the entropy of $$z$$ estimate. Let $$\mathcal{H}(\hat{p}(z) | y)$$ be the entropy of $$z$$ estimate after observation $$y$$. The information gain: $$\text{IG}(z, y) = \mathbb{E}_y [\mathcal{H}(\hat{p}(z)) - \mathcal{H}(\hat{p}(z) | y)]$$. Choose action to maximize the IG. ## General Problems ### Count-Based Exploration Similar to optimistic exploration, we can add bonus with MDPs: $$r^+(\mathbf{s}, \mathbf{a}) = r(\mathbf{s}, \mathbf{a}) + \mathcal{B}(N(\mathbf{s}))$$, where $$\mathcal{B}$$ decreases with $$N(\mathbf{s})$$. In practice, counts of states are hard to calculate. Instead fit a model $$p_\theta(\mathbf{s})$$ to estimate the density of $$\mathbf{s}$$, and use it to get a "pseudo-count". With $p_{\theta}\left(\mathbf{s}_{i}\right)=\frac{\hat{N}\left(\mathbf{s}_{i}\right)}{\hat{n}} \qquad p_{\theta^{\prime}}\left(\mathbf{s}_{i}\right)=\frac{\hat{N}\left(\mathbf{s}_{i}\right)+1}{\hat{n}+1}$ we can get $\hat{N}(\mathbf{s}_{i})=\hat{n} p_{\theta}(\mathbf{s}_{i}) \qquad \hat{n}=\frac{1-p_{\theta^{\prime}}(\mathbf{s}_{i})} {p_{\theta^{\prime}}(\mathbf{s}_{i})-p_{\theta}(\mathbf{s}_{i})} p_{\theta}(\mathbf{s}_{i})$ ### Implicit Density Model A state is novel if it is easy to distinguish from all previous seen states by a classifier. So we can estimate the density by a classifier $$D$$: $p_{\theta}(\mathbf{s})=\frac{1-D_{\mathbf{s}}(\mathbf{s})}{D_{\mathbf{s}}(\mathbf{s})}$ Training one classifier per state is too much and in practice we often train one amortized model: single network that takes in exemplar as input. ### Heuristic Estimation Given target function $$f^\star(\mathbf{s}, \mathbf{a})$$ and buffer $$\mathcal{D} = \{ (\mathbf{s}_i, \mathbf{a}_i) \}$$, fit $$\hat{f}_\theta(\mathbf{s}, \mathbf{a})$$ and use $$\mathcal{E}(\mathbf{s}, \mathbf{a})=\left\|\hat{f}_{\theta}(\mathbf{s}, \mathbf{a})-f^{\star}(\mathbf{s}, \mathbf{a})\right\|^{2}$$ as bonus. One common choice for target function is the next state prediction: $$f^\star(\mathbf{s}, \mathbf{a}) = \mathbf{s}'$$, or simpler $$f^{\star}(\mathbf{s}, \mathbf{a})=f_{\phi}(\mathbf{s}, \mathbf{a})$$ where $$\phi$$ is a random parameter. ## RL Problems ### Posterior Sampling In the bandits problem, we sample from $$p_{\theta_i}(r_i)$$, which is a distribution over awards. In RL, we can sample from $$Q$$-function. 1. sample $$Q$$-function from $$p(Q)$$ 2. act according to $$Q$$ for one episode 3. update $$p(Q)$$ We can represent a distribution over functions by bootstrap: 1. given dataset $$\mathcal{D}$$, resample with replacement $$N$$ times to get $$\mathcal{D}_1, \dots, \mathcal{D}_N$$ 2. train each model $$f_{\theta_i}$$ on $$\mathcal{D}_i$$ 3. to sample from $$p(\theta)$$, sample $$i \in [1, \dots, N]$$ and use $$f_{\theta_i}$$ ### Information Gain Choices for $$\text{IG}(z,y|a)$$: • reward $$r(\mathbf{s}, \mathbf{a})$$: not very useful if reward is sparse • state density $$p(\mathbf{s})$$: strange but somewhat makes sense • dynamics $$p(\mathbf{s}'|\mathbf{s}, \mathbf{a})$$: good for learning the MDP, but still heuristic IG is generally intractable to use exactly, but we can do approximations: • prediction gain: $$\log p_{\theta^{\prime}}(\mathbf{s})-\log p_{\theta}(\mathbf{s})$$. If density changes a lot, the state is novel. • variational inference: IG is equivalent to $$D_{\mathrm{KL}}(p(z | y) \| p(z))$$. To learn about dynamics $$p_{\theta}\left(s_{t+1} | s_{t}, a_{t}\right)$$, let $$z=\theta$$, $$y = (s_{t+1} | s_{t}, a_{t})$$. Then use variational inference to estimate $$q(\theta | \phi) \approx p(\theta | h)$$ ## Imitation vs. RL Imitation learning: • Require demonstrations • Distributional shift • Simple, stable supervised learning • Only as good as the demo RL: • Requires reward function • Must address exploration • Potentially non-convergent RL • Can become arbitrarily good Can we combine the best of both if we have demonstrations and rewards? IRL already addresses distributional shift via RL, but it doesn't use a known reward function. The simplest way is to use pretrain & finetune: 1. collected demonstrations data $$(\mathbf{s}_i, \mathbf{a}_i)$$ 2. initialize $$\pi_\theta$$ as $$\max_\theta \sum_i \log \pi_\theta(\mathbf{a}_i | \mathbf{s}_i)$$ 3. run $$\pi_\theta$$ to collect experience 4. improve $$\pi_\theta$$ with any RL algorithm The problem is in step 3 and 4 where the policy can be very bad due to distribution shift, and the first batch of bad data can destroy initialization. The solution is to use off-policy RL and treat demonstrations as off-policy samples. ### Learning With Demonstrations Since policy gradient if on-policy, we need to use demonstrations together with importance sampling: $\nabla_{\theta} J(\theta)=\sum_{\tau \in \mathcal{D}}\left[\sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)\left(\prod_{t^{\prime}=1}^{t} \frac{\pi_{\theta}\left(\mathbf{a}_{t^{\prime}} | \mathbf{s}_{t^{\prime}}\right)}{q\left(\mathbf{a}_{t^{\prime}} | \mathbf{s}_{t^{\prime}}\right)}\right)\left(\sum_{t^{\prime}=t}^{T} r\left(\mathbf{s}_{t^{\prime}}, \mathbf{a}_{t^{\prime}}\right)\right)\right]$ Sample distribution choice: 1. use supervised behavior cloning to approximate $$\pi_{\text{demo}}$$ 2. assume Dirac delta $$\pi_{\text{demo}}(\tau) = \delta(\tau \in D) / N$$ Fusion multiple distributions (demo & policy sample): $$q(x) = \sum_i q_i(x) / M$$ $$Q$$-learning is already off-policy, no need to use importance sampling. Simple solution is to drop demonstrations into the replay buffer. ### Hybrid Goal • Imitation objective: $$\sum_{(\mathbf{s}, \mathbf{a}) \in \mathcal{D}_{\text {demo}}} \log \pi_{\theta}(\mathbf{a} | \mathbf{s})$$ • RL objective: $$\mathbb{E}_{\pi_{\theta}}[r(\mathbf{s}, \mathbf{a})]$$ • Hybrid objective: $$\mathbb{E}_{\pi_{\theta}}[r(\mathbf{s}, \mathbf{a})] + \lambda \sum_{(\mathbf{s}, \mathbf{a}) \in \mathcal{D}_{\text {demo}}} \log \pi_{\theta}(\mathbf{a} | \mathbf{s})$$ # Meta-RL Regular RL: learn policy for single task $$\mathcal{M}$$ \begin{aligned}\theta^{\star} &=\argmax_{\theta} \mathbb{E}_{\pi_{\theta}(\tau)}[R(\tau)] \\&=f_{\mathrm{RL}}(\mathcal{M})\end{aligned} Meta-RL: learn adaptation rule $\theta^{\star}=\arg \max _{\theta} \sum_{i=1}^{n} \mathbb{E}_{\pi_{\phi_{i}}(\tau)}[R(\tau)] \\\phi_{i}=f_{\theta}\left(\mathcal{M}_{i}\right)$ General Meta-RL Algorithm: 1. sample task $$i$$, collect data $$\mathcal{D}_i$$ 2. adapt policy by computing $$\phi_i = f(\theta, \mathcal{D}_i)$$ 3. collect data $$\mathcal{D}'_i$$ with adapted policy $$\pi_{\phi_i}$$ 4. update $$\theta$$ according to $$\mathcal{L}(\mathcal{D}'_i, \phi_i)$$ Specific algorithms depend on the choice of $$f$$ and $$\mathcal{L}$$ ## Algorithms ### Recurrence Implement the policy as a recurrent network so that it can remember old data. 1. initialize hidden state $$\mathbf{h}_0 = 0$$ for task $$i$$ 2. sample transition $$\mathcal{D}_{i}=\mathcal{D}_{i} \cup\left\{\left(\mathbf{s}_{t}, \mathbf{a}_{t}, \mathbf{s}_{t+1}, r_{t}\right)\right\}$$ from $$\pi_{\mathbf{h}_t}$$ 3. update policy hidden state $$\mathbf{h}_{t+1} = f_\theta(\mathbf{h}_t,\mathbf{s}_{t}, \mathbf{a}_{t}, \mathbf{s}_{t+1}, r_{t})$$ 4. update policy parameters $$\theta \leftarrow \theta-\nabla_{\theta} \sum_{i} \mathcal{L}_{i}\left(\mathcal{D}_{i}, \pi_{\mathbf{h}}\right)$$ ### Optimization 1. sample $$k$$ episodes $$\mathcal{D}_{i}=\left\{\left(\mathbf{s}, \mathbf{a}, \mathbf{s}', r\right)_{1:k}\right\}$$ from $$\pi_{\theta}$$ 2. compute adapted parameters $$\theta'_{i}=\theta-\alpha \nabla_{\theta} \mathcal{L}_{i}\left(\pi_{\theta}, \mathcal{D}_{i}\right)$$ 3. sample $$k$$ episodes $$\mathcal{D}'_{i}=\left\{\left(\mathbf{s}, \mathbf{a}, \mathbf{s}', r\right)_{1:k}\right\}$$ from $$\pi_{\theta'}$$ 4. $$\theta \leftarrow \theta-\nabla_{\theta} \sum_{i} \mathcal{L}_{i}\left(\mathcal{D}'_{i}, \pi_{\theta'_i}\right)$$ Step 4 requires second order derivatives. ### Latent Model Use adaptation data $$\mathbf{c}$$ to train a latent variable $$\mathbf{z}$$ to represent task-belief states $$p(\mathbf{z} | \mathbf{c})$$. The training objective: $\mathbb{E}_{\mathcal{T}}\left[\mathbb{E}_{\mathbf{z} \sim q_{\phi}\left(\mathbf{z} | \mathbf{c}^{\mathcal{T}}\right)}\left[R(\mathcal{T}, \mathbf{z})+\beta D_{\mathrm{KL}}(q_{\phi}(\mathbf{z} | \mathbf{c}^{\mathcal{T}})|| p(\mathbf{z}))\right]\right]$ where $$R$$ is the "likelihood" term (Bellman error), KL-divergence is the "Regularization" term. # Information-Theoretic Exploration ==TODO== # Challenges in Deep RL ==TODO== # Proof ## Proof 1 If the action prior is not uniform, then $V \left( \mathbf { s } _{ t } \right) = \log \int \exp \left( Q \left( \mathbf { s }_ { t } , \mathbf { a } _{ t } \right) + \log p \left( \mathbf { a }_ { t } | \mathbf { s } _{ t } \right) \right) \mathbf { a }_ { t } \\Q \left( \mathbf { s } _{ t } , \mathbf { a }_ { t } \right) = r \left( \mathbf { s } _{ t } , \mathbf { a }_ { t } \right) + \log \mathbb{E} \left[ \exp \left( V \left( \mathbf { s } _ { t + 1 } \right) \right) \right]$ Let $$\tilde { Q } \left( \mathbf { s } _{ t } , \mathbf { a }_ { t } \right) = r \left( \mathbf { s } _{ t } , \mathbf { a }_ { t } \right) + \log p \left( \mathbf { a } _{ t } | \mathbf { s }_ { t } \right) + \log \mathbb{E} \left[ \exp \left( V \left( \mathbf { s } _ { t + 1 } \right) \right) \right]$$, we can fold the action prior into reward, and $$V$$ becomes $V \left( \mathbf { s } _{ t } \right) = \log \int \exp \left( \tilde{Q} \left( \mathbf { s }_ { t } , \mathbf { a } _{ t } \right) \right) \mathbf { a }_ { t }$ ## Proof 2 The ELBO is $\log p(\mathcal{O}_{1:T}) \ge \mathbb{E}_{(\mathbf{s}_{1:T}, \mathbf{a}_{1:T}) \sim q} \left[ \log p(\mathbf{s}_{1:T}, \mathbf{a}_{1:T}, \mathcal{O}_{1:T}) - \log q(\mathbf{s}_{1:T}, \mathbf{a}_{1:T}) \right]$ Plug in $$q(\mathbf{s}_{1:T}, \mathbf{a}_{1:T})$$ and $$p(\mathbf{s}_{1:T}, \mathbf{a}_{1:T}, \mathcal{O}_{1:T}) = p(\mathbf{s}_{1}) \prod_t p(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}) p(\mathcal{O}_{t} | \mathbf{s}_{t}, \mathbf{a}_{t})$$ we can get \begin{aligned}\log p(\mathcal{O}_{1:T}) &\ge \mathbb{E}_{(\mathbf{s}_{1:T}, \mathbf{a}_{1:T}) \sim q} \left[ \sum_{t=1}^T \log p(\mathcal{O}_{t} | \mathbf{s}_{t}, \mathbf{a}_{t}) - \sum_{t=1}^T \log q(\mathbf{a}_{t} | \mathbf{s}_{t}) \right] \\&= \mathbb{E}_{(\mathbf{s}_{1:T}, \mathbf{a}_{1:T}) \sim q} \left[ \sum_{t=1}^T \big( r(\mathbf{s}_{t}, \mathbf{a}_{t}) - \log q(\mathbf{a}_{t} | \mathbf{s}_{t}) \big) \right] \\&= \sum_{t=1}^T \mathbb{E}_{(\mathbf{s}_{t}, \mathbf{a}_{t}) \sim q} \left[ r(\mathbf{s}_{t}, \mathbf{a}_{t}) + \mathcal{H}(q(\mathbf{a}_{t} | \mathbf{s}_{t})) \right]\end{aligned} Optimizing the base case $$t = T$$: $\mathbb{E} _{ \left( \mathbf { s }_ { T } , \mathbf { a } _{ T } \right) \sim q } \left[ r \left( \mathbf { s } _ { T } , \mathbf { a } _ { T } \right) - \log q \left( \mathbf { a } _ { T } | \mathbf { s } _ { T } \right) \right] = \\\mathbb{E}_ { \mathbf { s } _{ T } \sim q \left( \mathbf { s }_ { T } \right) } \left[ - D _ { \mathrm { KL } } \left( q \left( \mathbf { a } _ { T } | \mathbf { s } _ { T } \right) \| \frac { 1 } { \exp \left( V \left( \mathbf { s } _ { T } \right) \right) } \exp \left( r \left( \mathbf { s } _ { T } , \mathbf { a } _ { T } \right) \right) \right) + V \left( \mathbf { s } _ { T } \right) \right]$ where $$V(\mathbf{s}_t) = \log \int \exp(Q(\mathbf { s }_ { t } , \mathbf { a } _{ t })) d \mathbf{a}_t$$ and $$\exp(V(\mathbf{s}_T))$$ is the normalizing constant for $$\exp(r ( \mathbf { s }_ { T } , \mathbf { a } _{ T } ))$$. So the optimal policy is $q ( \mathbf { a }_ { T } | \mathbf { s } _{ T } ) = \exp(r ( \mathbf { s }_ { T } , \mathbf { a } _{ T } ) - V(\mathbf{s}_T))$ For a given time step $$t$$, $$q ( \mathbf { a } _{ t } | \mathbf { s }_ { t } )$$ must maximize two terms: $\begin{equation}\mathbb{E} _{ \left( \mathbf { s }_ { t } , \mathbf { a } _{ t } \right) \sim q \left( \mathbf { s }_ { t } , \mathbf { a } _{ t } \right) } \left[ r \left( \mathbf { s } _ { t } , \mathbf { a } _ { t } \right) - \log q \left( \mathbf { a } _ { t } | \mathbf { s } _ { t } \right) \right] + \mathbb{E}_ { \left( \mathbf { s } _{ t } , \mathbf { a }_ { t } \right) \sim q \left( \mathbf { s } _{ t } , \mathbf { a }_ { t } \right) } \left[ \mathbb{E} _ { \mathbf { s } _ { t + 1 } \sim p \left( \mathbf { s } _ { t + 1 } | \mathbf { s } _ { t } , \mathbf { a } _ { t } \right) } \left[ V \left( \mathbf { s } _ { t + 1 } \right) \right] \right]\label{dpobj}\end{equation}$ where the second term represents the contribution of $$q ( \mathbf { a } _{ T } | \mathbf { s }_ { T } )$$ to the expectations of all subsequent time steps. This can be seen by plugging the optimal policy into the base case to leave only $$V(\mathbf{s}_T)$$. Rewrite $$\eqref{dpobj}$$ as $\mathbb{E}_ { \mathbf { s } _{ t } \sim q \left( \mathbf { s }_ { t } \right) } \left[ - D _ { \mathrm { KL } } \left( q \left( \mathbf { a } _ { t } | \mathbf { s } _ { t } \right) \| \frac { 1 } { \exp \left( V \left( \mathbf { s } _ { t } \right) \right) } \exp \left( Q \left( \mathbf { s } _ { t } , \mathbf { a } _ { t } \right) \right) \right) + V \left( \mathbf { s } _ { t } \right) \right]$ where we now define \begin{aligned}Q \left( \mathbf { s } _{ t } , \mathbf { a }_ { t } \right) & = r \left( \mathbf { s } _{ t } , \mathbf { a }_ { t } \right) + \mathbb{E} _{ \mathbf { s }_ { t + 1 } \sim p \left( \mathbf { s } _{ t + 1 } | \mathbf { s }_ { t } , \mathbf { a } _{ t } \right) } \left[ V \left( \mathbf { s } _ { t + 1 } \right) \right] \\V \left( \mathbf { s }_ { t } \right) & = \log \int \exp \left( Q \left( \mathbf { s } _{ t } , \mathbf { a }_ { t } \right) \right) d \mathbf { a } _{ t }\end{aligned} and the optimal policy is $$q ( \mathbf{a}_t | \mathbf{s}_t) = \exp(Q(\mathbf{s}_t, \mathbf{a}_t) - V(\mathbf{s}_t))$$ ]]> Deep Reinforcement Learning (Part 2) https://silencial.github.io/deep-reinforcement-learning-2/ 2020-02-06T00:00:00.000Z 2020-02-12T00:00:00.000Z Berkeley CS 285 Review My solution to the homework Deep Reinforcement Learning (Part 1) Deep Reinforcement Learning (Part 3) # Optimal Control and Planning If we know the dynamics $$p(\mathbf{x}_{t+1} | \mathbf{x}_t, \mathbf{u}_t)$$ • Games (e.g. Atari games, chess, Go) • Easily modeled systems (e.g. navigating a car) • Simulated environment (e.g. simulated robots, video games) Or we can learn the dynamics: • System identification - fit unknown parameters of a known model • Learning - fit a general-purpose model to observed transition data Often it will make learning easier. Previous methods (policy gradient, value based, actor-critic) do not require $$p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)$$ information to learn, so called model-free RL. ## Open-Loop vs. Closed-Loop Planning For deterministic open-loop system, the objective is: $\begin{equation}\DeclareMathOperator*{\argmin}{\arg\min}\DeclareMathOperator*{\argmax}{\arg\max}\mathbf{a}_{1}, \ldots, \mathbf{a}_{T}=\argmax _{\mathbf{a}_{1}, \ldots, \mathbf{a}_{T}} \sum_{t=1}^{T} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \\\text {s.t.} \quad \mathbf{s}_{t+1}=f\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\label{dol}\end{equation}$ For stochastic open-loop system, the objective is: $p_{\theta}\left(\mathbf{s}_{1}, \ldots, \mathbf{s}_{T} | \mathbf{a}_{1}, \ldots, \mathbf{a}_{T}\right)=p\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right) \\\mathbf{a}_{1}, \ldots, \mathbf{a}_{T}=\argmax_{\mathbf{a}_{1}, \ldots, \mathbf{a}_{T}} E\left[\sum_{t} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) | \mathbf{a}_{1}, \ldots, \mathbf{a}_{T}\right]$ For stochastic closed-loop system, the objective is: $p_{\theta}\left(\mathbf{s}_{1}, \ldots, \mathbf{s}_{T} | \mathbf{a}_{1}, \ldots, \mathbf{a}_{T}\right)=p\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right) \\\pi=\argmax _{\pi} E_{\tau \sim p(\tau)}\left[\sum_{t} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right]$ ## Stochastic Optimization Methods For open-loop planning, rewrite the objective $$\eqref{dol}$$ to be: $\mathbf{A} = \argmax_\mathbf{A} J (\mathbf{A})$ where $$\mathbf{A} = (\mathbf{a}_1, \dots, \mathbf{a}_T)$$. $$J$$ is the general objective. ### Random Shooting Method 1. pick $$\mathbf{A}_1, \dots, \mathbf{A}_N$$ from some distribution (e.g. uniform) 2. choose $$\mathbf{A}_i$$ based on the $$\argmax_i J(\mathbf{A}_i)$$ ### Cross-Entropy Method (CEM) 1. sample $$\mathbf{A}_1, \dots, \mathbf{A}_N$$ from some distribution $$p(\mathbf{A})$$ 2. evaluate $$J(\mathbf{A}_i)$$ and pick the elites $$\mathbf{A}_{i_1}, \dots, \mathbf{A}_{i_M}$$ with the highest value 3. refit $$p(\mathbf{A})$$ to the elites It is very fast if parallelized and extremely simple. But has very harsh dimensionality limit and only for open-loop planning. ## Monte Carlo Tree Search (MCTS) For discrete case 1. choose a leaf $$s_l$$ using Treepolicy($$s_1$$) 2. evaluate the leaf using DefaultPolicy($$s_l$$) 3. update all values in tree between $$s_1$$ and $$s_l$$ Upper Confidence Bounds for Trees (UCT) TreePolicy($$s_t$$): if $$s_t$$ not fully expanded, choose new $$a_t$$, else choose child with best Score($$s_{t+1}$$). Where $$\operatorname { Score } \left( s_ { t } \right) = \frac { Q \left( s _{ t } \right) } { N \left( s_ { t } \right) } + 2 C \sqrt { \frac { 2 \ln N \left( s _{ t - 1 } \right) } { N \left( s_ { t } \right) } }$$ ## Trajectory Optimization There are two different ways to optimize $$\eqref{dol}$$: shooting method: optimize over actions only. ==Extremely sensitive to the initial action== $\min _{ \mathbf { u }_ { 1 } , \ldots , \mathbf { u } _{ T } } c \left( \mathbf { x }_ { 1 } , \mathbf { u } _{ 1 } \right) + c \left( f \left( \mathbf { x }_ { 1 } , \mathbf { u } _{ 1 } \right) , \mathbf { u }_ { 2 } \right) + \cdots + c \left( f ( f ( \ldots ) \ldots ) , \mathbf { u } _{ T } \right)$ collocation method: optimize over actions and states, with constraints $\min_ { \mathbf { u } _{ 1 } , \ldots , \mathbf { u }_ { T }, \mathbf { x } _{ 1 } , \ldots , \mathbf { x }_ { T } } \sum _{ t = 1 } ^ { T } c \left( \mathbf { x }_ { t } , \mathbf { u } _{ t } \right)\quad \text { s.t. }\mathbf { x }_ { t } = f \left( \mathbf { x } _{ t - 1 } , \mathbf { u }_ { t - 1 } \right)$ ## LQR Now we focus on shooting method but assume $$f$$ is linear and $$c$$ is quadratic: $f \left( \mathbf { x } _{ t } , \mathbf { u }_ { t } \right) = \mathbf { F } _{ t } \left[ \begin{array} { c } { \mathbf { x } _ { t } } \\ { \mathbf { u } _ { t } } \end{array} \right] + \mathbf { f }_ { t } \quad c \left( \mathbf { x } _{ t } , \mathbf { u }_ { t } \right) = \frac { 1 } { 2 } \left[ \begin{array} { l } { \mathbf { x } _ { t } } \\ { \mathbf { u } _ { t } } \end{array} \right] ^ { T } \mathbf { C } _{ t } \left[ \begin{array} { l } { \mathbf { x } _ { t } } \\ { \mathbf { u } _ { t } } \end{array} \right] + \left[ \begin{array} { l } { \mathbf { x } _ { t } } \\ { \mathbf { u } _ { t } } \end{array} \right] ^ { T } \mathbf { c }_ { t }$ Then we can solve it by backward recursion and forward recursion (Proof) Backward recursion: for $$t=T$$ to $$1$$ \begin{aligned}&\mathbf { Q } _{ t } = \mathbf { C }_ { t } + \mathbf { F } _{ t } ^ { T } \mathbf { V }_ { t + 1 } \mathbf { F } _{ t } \\&\mathbf { q }_ { t } = \mathbf { c } _{ t } + \mathbf { F }_ { t } ^ { T } \mathbf { V } _{ t + 1 } \mathbf { f }_ { t } + \mathbf { F } _{ t } ^ { T } \mathbf { v }_ { t + 1 } \\&Q \left( \mathbf { x } _{ t } , \mathbf { u }_ { t } \right) = \text { const } + \frac { 1 } { 2 } \begin{bmatrix} { \mathbf { x } _{ t } } \\ { \mathbf { u }_ { t } } \end{bmatrix} ^ { T } \mathbf { Q } _{ t } \begin{bmatrix} { \mathbf { x }_ { t } } \\ { \mathbf { u } _{ t } } \end{bmatrix} + \begin{bmatrix} { \mathbf { x }_ { t } } \\ { \mathbf { u } _{ t } } \end{bmatrix} ^ { T } \mathbf { q }_ { t } \\&\mathbf { u } _{ t } \leftarrow \argmin_ { \mathbf { u } _{ t } } Q \left( \mathbf { x }_ { t } , \mathbf { u } _{ t } \right) = \mathbf { K }_ { t } \mathbf { x } _{ t } + \mathbf { k }_ { t } \\&{ \mathbf { K } _{ t } = - \mathbf { Q }_ { \mathbf { u } _{ t } , \mathbf { u }_ { t } } ^ { - 1 } \mathbf { Q } _{ \mathbf { u }_ { t } , \mathbf { x } _{ t } } } \\&{ \mathbf { k }_ { t } = - \mathbf { Q } _{ \mathbf { u }_ { t } , \mathbf { u } _{ t } } ^ { - 1 } \mathbf { q }_ { \mathbf { u } _{ t } } } \\&{ \mathbf { V }_ { t } = \mathbf { Q } _{ \mathbf { x }_ { t } , \mathbf { x } _{ t } } + \mathbf { Q }_ { \mathbf { x } _{ t } , \mathbf { u }_ { t } } \mathbf { K } _{ t } + \mathbf { K }_ { t } ^ { T } \mathbf { Q } _{ \mathbf { u }_ { t } , \mathbf { x } _{ t } } + \mathbf { K }_ { t } ^ { T } \mathbf { Q } _{ \mathbf { u }_ { t } , \mathbf { u } _{ t } } \mathbf { K }_ { t } } \\&{ \mathbf { v } _{ t } = \mathbf { q }_ { \mathbf { x } _{ t } } + \mathbf { Q }_ { \mathbf { x } _{ t } , \mathbf { u }_ { t } } \mathbf { k } _{ t } + \mathbf { K }_ { t } ^ { T } \mathbf { Q } _{ \mathbf { u }_ { t } } + \mathbf { K } _{ t } ^ { T } \mathbf { Q }_ { \mathbf { u } _{ t } , \mathbf { u }_ { t } } \mathbf { k } _{ t } } \\&{ V \left( \mathbf { x }_ { t } \right) = \text { const } + \frac { 1 } { 2 } \mathbf { x } _{ t } ^ { T } \mathbf { V }_ { t } \mathbf { x } _{ t } + \mathbf { x }_ { t } ^ { T } \mathbf { v } _{ t } }\end{aligned} Forward recursion: for $$t=1$$ to $$T$$ $\mathbf{u}_{t} = \mathbf{K}_{t} \mathbf{x}_{t} + \mathbf{k}_{t} \\\mathbf{x}_{t+1} = f(\mathbf{x}_{t}, \mathbf{u}_{t})$ ### Stochastic Dynamics For stochastic dynamics, we can use Gaussians to model the dynamics: $p(\mathbf{x}_{t+1} | \mathbf{x}_{t}, \mathbf{u}_{t}) = \mathcal{N}\left( \mathbf { F }_ { t } \left[ \begin{array} { c } { \mathbf { x } _ { t } } \\ { \mathbf { u } _ { t } } \end{array} \right] + \mathbf { f } _{ t }, \Sigma_t \right)$ the algorithm stays the same. $$\Sigma_t$$ can be ignored due to the symmetry of Gaussians. ## iLQR For nonlinear case, approximate $$f$$ and $$c$$ by first and second order approximation respectively: $f\left(\mathbf{x}_{t}, \mathbf{u}_{t}\right)-f\left(\hat{\mathbf{x}}_{t}, \hat{\mathbf{u}}_{t}\right) \approx \nabla_{\mathbf{x}_{t}, \mathbf{u}_{t}} f\left(\hat{\mathbf{x}}_{t}, \hat{\mathbf{u}}_{t}\right)\begin{bmatrix}\mathbf{x}_{t}-\hat{\mathbf{x}}_{t} \\\mathbf{u}_{t}-\hat{\mathbf{u}}_{t}\end{bmatrix}\\c\left(\mathbf{x}_{t}, \mathbf{u}_{t}\right)-c\left(\hat{\mathbf{x}}_{t}, \hat{\mathbf{u}}_{t}\right) \approx \nabla_{\mathbf{x}_{t}, \mathbf{u}_{t}} c\left(\hat{\mathbf{x}}_{t}, \hat{\mathbf{u}}_{t}\right)\begin{bmatrix}\mathbf{x}_{t}-\hat{\mathbf{x}}_{t} \\\mathbf{u}_{t}-\hat{\mathbf{u}}_{t}\end{bmatrix}+\frac{1}{2}\begin{bmatrix}\mathbf{x}_{t}-\hat{\mathbf{x}}_{t} \\\mathbf{u}_{t}-\hat{\mathbf{u}}_{t}\end{bmatrix}^{T} \nabla_{\mathbf{x}_{t}, u_{t}}^{2} c\left(\hat{\mathbf{x}}_{t}, \hat{\mathbf{u}}_{t}\right)\begin{bmatrix}\mathbf{x}_{t}-\hat{\mathbf{x}}_{t} \\\mathbf{u}_{t}-\hat{\mathbf{u}}_{t}\end{bmatrix}$ Let $$\delta \mathbf{x}_{t} = \mathbf{x}_{t} - \hat{\mathbf{x}}_{t}$$, $$\delta \mathbf{u}_{t} = \mathbf{u}_{t} - \hat{\mathbf{u}}_{t}$$. Rearrange to get: $\bar{f}\left(\delta \mathbf{x}_{t}, \delta \mathbf{u}_{t}\right)=\mathbf{F}_{t}\begin{bmatrix}\delta \mathbf{x}_{t} \\\delta \mathbf{u}_{t}\end{bmatrix} \\\bar{c}\left(\delta \mathbf{x}_{t}, \delta \mathbf{u}_{t}\right)=\frac{1}{2}\begin{bmatrix}\delta \mathbf{x}_{t} \\\delta \mathbf{u}_{t}\end{bmatrix}^{T} \mathbf{C}_{t}\begin{bmatrix}\delta \mathbf{x}_{t} \\\delta \mathbf{u}_{t}\end{bmatrix}+\begin{bmatrix}\delta \mathbf{x}_{t} \\\delta \mathbf{u}_{t}\end{bmatrix}_{L}^{T} \mathbf{c}_{t}$ and run LQR with $$\bar{f}, \bar{c}, \delta \mathbf{x}_t, \delta \mathbf{u}_t$$. iLQR may have overshoot problem, line search can be applied to correct this problem. During the forward pass, search for the lowest point to avoid overshoot. The final algorithm becomes 1. $$\mathbf{F}_{t}=\nabla_{\mathbf{x}_{t}, \mathbf{u}_{t}} f\left(\hat{\mathbf{x}}_{t}, \hat{\mathbf{u}}_{t}\right)$$ 2. $$\mathbf{c}_{t}=\nabla_{\mathbf{x}_{t}, \mathbf{u}_{t}} c\left(\hat{\mathbf{x}}_{t}, \hat{\mathbf{u}}_{t}\right)$$ 3. $$\mathbf{C}_{t}=\nabla_{\mathbf{x}_{t}, \mathbf{u}_{t}}^{2} c\left(\hat{\mathbf{x}}_{t}, \hat{\mathbf{u}}_{t}\right)$$ 4. Run LQR backward pass on $$\delta \mathbf{x}_t, \delta \mathbf{u}_t$$ 5. Run forward pass with $$\mathbf{u}_{t} = \mathbf{K}_{t} \mathbf{x}_{t} + \alpha \mathbf{k}_{t} + \hat{\mathbf{u}}_t$$ ## DDP Compare to Newton's method for computing $$\min_\mathbf{x} g(\mathbf{x})$$: $\mathbf{g}=\nabla_{\mathbf{x}} g(\hat{\mathbf{x}}) \\\mathbf{H}=\nabla_{\mathbf{x}}^{2} g(\hat{\mathbf{x}}) \\\hat{\mathbf{x}} \leftarrow \arg \min_{\mathbf{x}} \frac{1}{2}(\mathbf{x}-\hat{\mathbf{x}})^{T} \mathbf{H}(\mathbf{x}-\hat{\mathbf{x}})+\mathbf{g}^{T}(\mathbf{x}-\hat{\mathbf{x}})$ If we use second order dynamics approximation, the method is called differential dynamic programming (DDP) # Model-Based RL ## Basic Model ### Naive Model 1. run base policy $$\pi_0(\mathbf{a}_t | \mathbf{s}_t)$$ to collect $$\mathcal { D } = \left\{ \left( \mathbf { s } , \mathbf { a } , \mathbf { s } ^ { \prime } \right)_ { i } \right\}$$ 2. learn dynamics model $$f(\mathbf{s}, \mathbf{a})$$ to minimize $$\sum _{ i } \left\| f \left( \mathbf { s }_ { i } , \mathbf { a } _{ i } \right) - \mathbf { s }_ { i } ^ { \prime } \right\| ^ { 2 }$$ 3. plan through $$f(\mathbf{s}, \mathbf{a})$$ to choose actions This is how system identification works in classical robotics and particularly effective if we can hand-engineer a dynamics representation using our knowledge of physics, and fit just a few parameters. ### Improvements Similar to imitation learning, naive model can suffer from distribution mismatch problem, which can be solved by DAgger for model $$f$$ instead of policy $$\pi$$. Instead of doing open-loop planning in step 3, which can cause cumulative error, we do replanning every time. Combine these two strategies: 1. run base policy $$\pi_0(\mathbf{a}_t | \mathbf{s}_t)$$ to collect $$\mathcal { D } = \left\{ \left( \mathbf { s } , \mathbf { a } , \mathbf { s } ^ { \prime } \right)_ { i } \right\}$$ 1. learn dynamics model $$f(\mathbf{s}, \mathbf{a})$$ to minimize $$\sum _{ i } \left\| f \left( \mathbf { s }_ { i } , \mathbf { a } _{ i } \right) - \mathbf { s }_ { i } ^ { \prime } \right\| ^ { 2 }$$ 1. plan through $$f(\mathbf{s}, \mathbf{a})$$ to choose actions 2. execute the first planned action, observe resulting state $$\mathbf{s}'$$ (MPC) 3. append $$\left( \mathbf { s } , \mathbf { a } , \mathbf { s } ^ { \prime } \right)$$ to $$\mathcal{D}$$ ## Uncertainty-Aware Models Basic model-based RL suffers from overfitting problem and can easily stuck in local minimum since in step 3 we only take actions for which we think the expectation reward is high. To solve this problem we must consider uncertainty in the model: • Use output entropy of $$p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t)$$. This is not enough because the model might be uncertain itself even if it is certain about the data. • estimate model uncertainty $$p(\theta | \mathcal{D})$$ and combine it with statistical uncertainty to get $$\int p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}, \theta\right) p(\theta | \mathcal{D}) d \theta$$ ### Models Bayesian neural network (BNN): In BNN, nodes are connected by distributions instead of weights. $p ( \theta | \mathcal { D } ) = \prod _{ i } p \left( \theta_ { i } | \mathcal { D } \right) \\p \left( \theta _{ i } | \mathcal { D } \right) = \mathcal { N } \left( \mu_ { i }, \sigma _{ i } \right)$ Bootstrap ensembles: Train multiple models and see if they agree. $\int p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}, \theta\right) p(\theta | \mathcal{D}) d \theta \approx \frac{1}{N} \sum_{i} p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}, \theta_{i}\right)$ Need to generate independent datasets to get independent models. However, resampling with replacement is usually unnecessary, because SGD and random initialization usually makes the models sufficiently independent. ## Latent Space Models For standard (fully observed) model, the goal is $\max _{ \phi } \frac { 1 } { N } \sum_ { i = 1 } ^ { N } \sum _{ t = 1 } ^ { T } \log p_ { \phi } \left( \mathbf { s } _{ t + 1 , i } | \mathbf { s }_ { t , i } , \mathbf { a } _ { t , i } \right)$ For complex observations (high dimensionality, redundancy, partial observability), we have to separately learn • $$p(\mathbf{o}_t | \mathbf{s}_t)$$: high-dimension but not dynamic • $$p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t)$$: low-dimension but dynamic For latent space model, the goal is $\max _{ \phi } \frac { 1 } { N } \sum_ { i = 1 } ^ { N } \sum _{ t = 1 } ^ { T } \mathbb{E}_{\left( \mathbf { s } _{ t } , \mathbf { s }_ { t + 1 } \right) \sim p \left( \mathbf { s } _{ t } , \mathbf { s }_ { t + 1 } | \mathbf { o } _{ 1 : T } , \mathbf { a }_ { 1 : T } \right)} \left[ \log p _ { \phi } \left( \mathbf { s } _ { t + 1 , i } | \mathbf { s } _ { t , i } , \mathbf { a } _ { t , i } \right) + \log p _ { \phi } \left( \mathbf { o } _ { t , i } | \mathbf { s } _ { t , i } \right) \right]$ We have many choices to approximate $$p \left( \mathbf { s } _{ t } , \mathbf { s }_ { t + 1 } | \mathbf { o } _{ 1 : T } , \mathbf { a }_ { 1 : T } \right)$$: • $$q_\psi ( \mathbf { s }_ { t } | \mathbf { o } _{ 1 : T } , \mathbf { a }_ { 1 : T } )$$: encoder • $$q_\psi ( \mathbf { s }_ { t }, \mathbf { s } _{ t+1} | \mathbf { o }_ { 1 : T } , \mathbf { a } _ { 1 : T } )$$: full smoothing posterior. Most accurate but complicated • $$q_\psi ( \mathbf { s }_ { t } | \mathbf { o } _ { t } )$$: single-step encoder. Simplest but least accurate Assume $$q(\mathbf{s}_t | \mathbf{o}_t)$$ is deterministic, we can get single-step deterministic encoder: $$q_\psi ( \mathbf { s }_ { t } | \mathbf { o } _{ t } ) = \delta(\mathbf{s}_t = g_\psi(\mathbf{o}_t)) \Rightarrow \mathbf { s } _{ t } = g_\psi(\mathbf { o } _{ t })$$. Add the reward model, the latent space model can be written as: $\max_ { \phi , \psi } \frac { 1 } { N } \sum _{ i = 1 } ^ { N } \sum_ { t = 1 } ^ { T } \log p _{ \phi } \left( g_ { \psi } \left( \mathbf { o } _{ t + 1 , i } \right) | g_ { \psi } \left( \mathbf { o } _{ t , i } \right) , \mathbf { a }_ { t , i } \right) + \log p _{ \phi } \left( \mathbf { o }_ { t , i } | g _{ \psi } \left( \mathbf { o }_ { t , i } \right) \right) + \log p _{ \phi } \left( r_ { t , i } | g _{ \psi } \left( \mathbf { o }_ { t , i } \right) \right)$ Everything is differentiable here so can be trained with backprop. 1. run base policy $$\pi_0(\mathbf{a}_t | \mathbf{o}_t)$$ to collect $$\mathcal { D } = \left\{ \left( \mathbf { o } , \mathbf { a } , \mathbf { o } ^ { \prime } \right)_ { i } \right\}$$ 1. learn $$p _{ \phi } \left( \mathbf { s }_ { t + 1} | \mathbf { s } _{ t} , \mathbf { a }_ { t } \right)$$, $$p _{ \phi } \left( \mathbf { o }_ { t } | \mathbf { s } _{ t } \right)$$, $$p_ { \phi } \left( r _{ t } | \mathbf { s }_ { t } \right)$$, $$g _{ \psi } \left( \mathbf { o }_ { t } \right)$$ 1. plan through the model to choose actions 2. execute the first planned action, observe resulting $$\mathbf{o}'$$ (MPC) 3. append $$\left( \mathbf { o } , \mathbf { a } , \mathbf { o } ^ { \prime } \right)$$ to $$\mathcal{D}$$ # Model-Based Policy Learning ## Backprop Into Policy 1. run base policy $$\pi_0(\mathbf{a}_t | \mathbf{s}_t)$$ to collect $$\mathcal { D } = \left\{ \left( \mathbf { s } , \mathbf { a } , \mathbf { s } ^ { \prime } \right)_ { i } \right\}$$ 1. learn dynamics model $$f(\mathbf{s}, \mathbf{a})$$ to minimize $$\sum _{ i } \left\| f \left( \mathbf { s }_ { i } , \mathbf { a } _{ i } \right) - \mathbf { s }_ { i } ^ { \prime } \right\| ^ { 2 }$$ 2. backprop through $$f(\mathbf{s}, \mathbf{a})$$ into the policy to optimize $$\pi_\theta(\mathbf{a}_t | \mathbf{s}_t)$$ 3. run $$\pi_\theta(\mathbf{a}_t | \mathbf{s}_t)$$, appending $$\left( \mathbf { s } , \mathbf { a } , \mathbf { s } ^ { \prime } \right)$$ to $$\mathcal{D}$$ Problems: • Similar parameter sensitivity problems as shooting methods. Policy parameters couple all the time steps, dynamic programming unavailable. • Similar problems to training long RNNs with BPTT • vanishing and exploding gradients. • Unlike LSTM, we can't choose a simple dynamics. Solutions: • Use derivative-free (model-free) RL algorithms, with the model used to generate synthetic samples. Works well in practice; essentially "model-based acceleration" for model-free RL. • Use simpler policies than neural nets. • LQR with learned models (LQR-FLM) • Train local policies to solve simple tasks • Combine them into global policies via supervised learning ## Model-Free RL With Model $\nabla _ { \theta } J ( \theta ) = \sum _ { t = 1 } ^ { T } \frac { d r _ { t } } { d \mathbf { s } _ { t } } \prod _ { t ^ { \prime } = 2 } ^ { t } \frac { d \mathbf { s } _ { t ^ { \prime } } } { d \mathbf { a } _ { t ^ { \prime } - 1 } } \frac { d \mathbf { a } _ { t ^ { \prime } - 1 } } { d \mathbf { s } _ { t ^ { \prime } - 1 } }$ Policy gradient might be more stable because it does not require multiplying many Jacobians. ### Dyna Online $$Q$$-learning algorithm that performs model-free RL with a model: 1. given state $$\mathbf{s}$$, pick action $$\mathbf{a}$$ using exploration policy 2. observe $$\mathbf{s}'$$ and $$r$$, to get transition $$(\mathbf{s}, \mathbf{a}, \mathbf{s}', r)$$ 3. update model $$\hat{p}(\mathbf{s}' | \mathbf{s}, \mathbf{a})$$ and $$\hat{r}(\mathbf{s}, \mathbf{a})$$ using $$(\mathbf{s}, \mathbf{a}, \mathbf{s}')$$ 4. $$Q$$-update: $$Q(\mathbf{s},\mathbf{a}) \leftarrow Q(\mathbf{s},\mathbf{a}) + \alpha \mathbb{E}_{\mathbf{s}', r} [r + \max_{\mathbf{a}'} Q(\mathbf{s}',\mathbf{a}') - Q(\mathbf{s},\mathbf{a})]$$ 1. sample $$(\mathbf{s},\mathbf{a}) \sim \mathcal{B}$$ from buffer of past states and actions 2. $$Q$$-update: $$Q(\mathbf{s},\mathbf{a}) \leftarrow Q(\mathbf{s}, \mathbf{a}) + \alpha \mathbb{E}_{\mathbf{s}' ,r} [r + \max_{\mathbf{a}'} Q(\mathbf{s}', \mathbf{a}') - Q(\mathbf{s},\mathbf{a})]$$ Model is used in $$Q$$-update to calculate the expectation instead of using samples as in $$Q$$-learning. General Dyna-style model-based RL: Generate samples from model for RL to learn. (step 5) 1. collect some data, consisting of transitions $$(\mathbf{s}, \mathbf{a}, \mathbf{s}', r)$$ 2. learn model $$\hat{p}(\mathbf{s}' | \mathbf{s}, \mathbf{a})$$ and $$\hat{r}(\mathbf{s}, \mathbf{a})$$ 1. sample $$\mathbf{s} \sim \mathcal{B}$$ from buffer 2. choose action $$\mathbf{a}$$ (from $$\mathcal{B}$$, $$\pi$$, or random) 3. simulate $$\mathbf{s}' \sim \hat{p}(\mathbf{s}' | \mathbf{s}, \mathbf{a})$$ and $$r = \hat{r}(\mathbf{s}, \mathbf{a})$$ 4. train on $$(\mathbf{s}, \mathbf{a}, \mathbf{s}', r)$$ with model-free RL 5. (optional) take $$N$$ more model-based steps ## Local Models Remember LQR gives a linear feedback controller: ${ p \left( \mathbf { x } _{ t + 1 } | \mathbf { x }_ { t } , \mathbf { u } _{ t } \right) = \mathcal { N } \left( f \left( \mathbf { x }_ { t } , \mathbf { u } _{ t } \right) , \Sigma \right) } \\{ f \left( \mathbf { x }_ { t } , \mathbf { u } _{ t } \right) \approx \mathbf { A }_ { t } \mathbf { x } _{ t } + \mathbf { B }_ { t } \mathbf { u } _{ t } } \\{ \mathbf { A }_ { t } = \frac { d f } { d \mathbf { x } _{ t } } \quad \mathbf { B }_ { t } = \frac { d f } { d \mathbf { u } _{ t } } }$ And $$\mathbf{A}_t, \mathbf{B}_t$$ can be learned instead of learning $$f$$. The whole algorithm can be seen below: Choices to update controller with iLQR output $$\hat{\mathbf{x}}_t, \hat{\mathbf{u}}_t, \mathbf{K}_t, \mathbf{k}_t$$: • $$p \left( \mathbf { u } _{ t } | \mathbf { x }_ { t } \right) = \delta(\mathbf { u } _{ t } = \hat{\mathbf { u } }_t)$$ : doesn't correct deviations or drift. • $$p \left( \mathbf { u } _{ t } | \mathbf { x }_ { t } \right) = \delta(\mathbf { u } _{ t } = \mathbf{K}_t (\mathbf{x}_t - \hat{\mathbf{x}}_t) + \mathbf{k}_t + \hat{\mathbf { u } }_t)$$: too strict. • $$p \left( \mathbf { u } _{ t } | \mathbf { x }_ { t } \right) = \mathcal{N}( \mathbf{K}_t (\mathbf{x}_t - \hat{\mathbf{x}}_t) + \mathbf{k}_t + \hat{\mathbf { u } }_t, \Sigma_t)$$ where $$\Sigma_t = \mathbf{Q}_{\mathbf{u}_t, \mathbf{u}_t}^{-1}$$: add noise so that samples don't look the same. Choices to fit the dynamics: • Linear regression: $$p \left( \mathbf { x } _{ t + 1 } | \mathbf { x }_ { t } , \mathbf { u } _{ t } \right) = \mathcal { N } \left( \mathbf { A }_ { t } \mathbf { x } _{ t } + \mathbf { B }_ { t } \mathbf { u } _{ t } + \mathbf { c } , \mathbf { N }_ { t } \right)$$ • Bayesian linear regression: use favorite global model as prior. Since we assume the model is locally linear, the updated controller is only good when it is close to old controller. Use trajectories generated from two controllers to measure their closeness by KL divergence $$D_{\mathrm { KL }}(p(\tau) \| \bar{p}(\tau)) \le \epsilon$$. ## Global Policy Guided policy search: 1. optimize each local policy $$\pi_{\mathrm{LQR}, i}(\mathbf{u}_t | \mathbf{x}_t)$$ on initial state $$\mathbf{x}_{0, i}$$ w.r.t. $$\tilde{c}_{k, i}(\mathbf{x}_t, \mathbf{u}_t)$$ 2. use samples from step 1 to train $$\pi_\theta(\mathbf{u}_t | \mathbf{x}_t)$$ to mimic each $$\pi_{\mathrm{LQR}, i}(\mathbf{u}_t | \mathbf{x}_t)$$ 3. update cost function $$\tilde{c}_{k+1, i}(\mathbf{x}_t, \mathbf{u}_t) = c(\mathbf{x}_t, \mathbf{u}_t) - \lambda_{k+1, i} \log \pi_\theta(\mathbf{u}_t | \mathbf{x}_t)$$ Distillation: make a single model as good as ensemble. Train on the ensemble's predictions as soft targets $p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$ For multi-task transfer, train independent model for different task $$\pi_i$$, then use supervised learning/distillation: $\mathcal{L} = \sum_{\mathbf{a}} \pi_i(\mathbf{a} | \mathbf{s}) \log \pi_{AMN}(\mathbf{a} | \mathbf{s})$ # Proof ## Proof 1 From the relation between $$Q$$ and $$V$$ we can get $\begin{equation}Q\left(\mathbf{x}_{T-1}, \mathbf{u}_{T-1}\right)=\text { const } + \frac{1}{2}\begin{bmatrix}\mathbf{x}_{T-1} \\\mathbf{u}_{T-1}\end{bmatrix}^{T} \mathbf{C}_{T-1} \begin{bmatrix}\mathbf{x}_{T-1} \\\mathbf{u}_{T-1}\end{bmatrix}+\begin{bmatrix}\mathbf{x}_{T-1} \\\mathbf{u}_{T-1}\end{bmatrix}^{T} \mathbf{c}_{T-1}+V\left(\mathbf{x}_{T}\right)\label{lqrqv}\end{equation}$ At step $$T$$, $$V = 0$$. Take the gradient of $$\eqref{lqrqv}$$ w.r.t. $$\mathbf{u}_T$$: $\begin{equation}\nabla_{\mathbf{u}_{T}} Q\left(\mathbf{x}_{T}, \mathbf{u}_{T}\right)=\mathbf{C}_{\mathbf{u}_{T}, \mathbf{x}_{T}} \mathbf{x}_{T}+\mathbf{C}_{\mathbf{u}_{T}, \mathbf{u}_{T}} \mathbf{u}_{T}+\mathbf{c}_{\mathbf{u}_{T}}^{T}\label{lqrg}\end{equation}$ Here we have expanded $$\mathbf{C}_T$$ and $$\mathbf{c}_T$$ to sub-matrices as follows: $\mathbf{C}_{T} =\begin{bmatrix}\mathbf{C}_{\mathbf{x}_{T}, \mathbf{x}_{T}} & \mathbf{C}_{\mathbf{x}_{T}, \mathbf{u}_{T}} \\\mathbf{C}_{\mathbf{u}_{T}, \mathbf{x}_{T}} & \mathbf{C}_{\mathbf{u}_{T}, \mathbf{u}_{T}}\end{bmatrix} \qquad\mathbf{c}_{T} =\begin{bmatrix}\mathbf{c}_{\mathbf{x}_{T}} \\\mathbf{c}_{\mathbf{u}_{T}}\end{bmatrix}$ Set the gradient $$\eqref{lqrg}$$ to $$0$$ and solve for $$\mathbf{u}_T$$: \begin{aligned}\mathbf{u}_{T} &= -\mathbf{C}_{\mathbf{u}_{T}, \mathbf{u}_{T}}^{-1}\left(\mathbf{C}_{\mathbf{u}_{T}, \mathbf{x}_{T}} \mathbf{x}_{T}+\mathbf{c}_{\mathbf{u}_{T}}\right) \\&= \mathbf{K}_T\mathbf{x}_T + \mathbf{k}_T\end{aligned} At step $$T-1$$, first compute $$V(\mathbf{x}_T)$$: \begin{aligned}V(\mathbf{x}_T) &= \text { const } + \frac{1}{2}\begin{bmatrix}\mathbf{x}_{T} \\\mathbf{K}_T\mathbf{x}_T + \mathbf{k}_T\end{bmatrix}^{T} \mathbf{C}_{T} \begin{bmatrix}\mathbf{x}_{T} \\\mathbf{K}_T\mathbf{x}_T + \mathbf{k}_T\end{bmatrix}+\begin{bmatrix}\mathbf{x}_{T} \\\mathbf{K}_T\mathbf{x}_T + \mathbf{k}_T\end{bmatrix}^{T} \mathbf{c}_{T} \\&= \frac{1}{2} \mathbf{x}_T^T \mathbf{V}_T \mathbf{x}_T + \mathbf{x}_T^T \mathbf{v}_{T} \\\mathbf{V}_{T} = &\mathbf{C}_{\mathbf{x}_{T}, \mathbf{x}_{T}}+\mathbf{C}_{\mathbf{x}_{T}, \mathbf{u}_{T}} \mathbf{K}_{T}+\mathbf{K}_{T}^{T} \mathbf{C}_{\mathbf{u}_{T}, \mathbf{x}_{T}}+\mathbf{K}_{T}^{T} \mathbf{C}_{\mathbf{u}_{T}, \mathbf{u}_{T}} \mathbf{K}_{T} \\\mathbf{v}_{T} = &\mathbf{c}_{\mathbf{x}_{T}}+\mathbf{C}_{\mathbf{x}_{T}, \mathbf{u}_{T}} \mathbf{k}_{T}+\mathbf{K}_{T}^{T} \mathbf{C}_{\mathbf{u}_{T}}+\mathbf{K}_{T}^{T} \mathbf{C}_{\mathbf{u}_{T}, \mathbf{u}_{T}} \mathbf{k}_{T}\end{aligned} Plug in the model $$\mathbf{x}_{T}=f\left(\mathbf{x}_{T-1}, \mathbf{u}_{T-1}\right)$$ in $$V$$ and then plug in $$V$$ in $$Q$$ to get $Q\left(\mathbf{x}_{T-1}, \mathbf{u}_{T-1}\right)=\text { const }+\frac{1}{2}\begin{bmatrix}{\mathbf{x}_{T-1}} \\{\mathbf{u}_{T-1}}\end{bmatrix}^{T} \mathbf{Q}_{T-1}\begin{bmatrix}{\mathbf{x}_{T-1}} \\{\mathbf{u}_{T-1}}\end{bmatrix}+\begin{bmatrix}{\mathbf{x}_{T-1}} \\{\mathbf{u}_{T-1}}\end{bmatrix}^{T} \mathbf{q}_{T-1} \\\mathbf{Q}_{T-1}=\mathbf{C}_{T-1}+\mathbf{F}_{T-1}^{T} \mathbf{V}_{T} \mathbf{F}_{T-1} \\\mathbf{q}_{T-1}=\mathbf{c}_{T-1}+\mathbf{F}_{T-1}^{T} \mathbf{V}_{T} \mathbf{f}_{T-1}+\mathbf{F}_{T-1}^{T} \mathbf{v}_{T}$ Take the gradient and set it to $$0$$ to get \begin{aligned}\mathbf{u}_{T-1} &= -\mathbf{Q}_{\mathbf{u}_{T-1}, \mathbf{u}_{T-1}}^{-1}\left(\mathbf{Q}_{\mathbf{u}_{T-1}, \mathbf{x}_{T-1}} \mathbf{x}_{T-1}+\mathbf{q}_{\mathbf{u}_{T-1}}\right) \\&= \mathbf{K}_{T-1}\mathbf{x}_{T-1} + \mathbf{k}_{T-1}\end{aligned} Continue this process we can get the backward pass. # References 1. RL — LQR & iLQR ]]> <p>Berkeley <a href="http://rail.eecs.berkeley.edu/deeprlcourse/">CS 285</a> Review</p> <p>My <a href="https://github.com/silencial/DeepRL">solution</a> to the homework</p> <p><a href="https://silencial.github.io/deep-reinforcement-learning-1/">Deep Reinforcement Learning (Part 1)</a></p> <p><a href="https://silencial.github.io/deep-reinforcement-learning-3/">Deep Reinforcement Learning (Part 3)</a></p> Deep Reinforcement Learning (Part 1) https://silencial.github.io/deep-reinforcement-learning-1/ 2020-02-05T00:00:00.000Z 2020-02-09T00:00:00.000Z Berkeley CS 285 Review My solution to the homework Deep Reinforcement Learning (Part 2) Deep Reinforcement Learning (Part 3) # Imitation Learning ## Behavior Cloning Use supervised learning with training data $$(\mathbf{o}_t, \mathbf{a}_t)$$ to learn the policy $$\pi_\theta(\mathbf{a}_t | \mathbf{o}_t)$$. Usually doesn't work since error $$p_{\pi_\theta}(\mathbf{o}_t) \ne p_{data}(\mathbf{o}_t)$$ will accumulate through time. [width=50%] ## DAgger Instead of being clever about $$p_{\pi_\theta}(\mathbf{o}_t) = p_{data}(\mathbf{o}_t)$$, we can use DAgger (Dataset Aggregation) to make $$p_{data}(\mathbf{o}_t) = p_{\pi_\theta}(\mathbf{o}_t)$$: 1. Train $$\pi_\theta(\mathbf{a}_t | \mathbf{o}_t)$$ from human data $$\mathcal{D} = \{\mathbf{o}_1, \mathbf{a}_1, \dots, \mathbf{o}_N, \mathbf{a}_N\}$$ 2. Run $$\pi_\theta(\mathbf{a}_t | \mathbf{o}_t)$$ to get dataset $$\mathcal{D}_\pi = \{\mathbf{o}_1, \dots, \mathbf{o}_M\}$$ 3. Ask human to label $$\mathcal{D}_\pi$$ with action $$\mathbf{a}_t$$ 4. Aggregate: $$\mathcal{D} \leftarrow \mathcal{D} \cup \mathcal{D}_\pi$$ ## Problem Non-Markovian behavior: $$\pi_\theta(\mathbf{a}_t | \mathbf{o}_t)$$ vs. $$\pi_\theta(\mathbf{a}_t | \mathbf{o}_1, \dots, \mathbf{o}_t)$$. Can be solved by using RNN with LSTM Multimodal behavior: The average of two good actions can be a bad action. Can be solved by • Represent distribution as mixture of Gaussians: $$\pi(\mathbf{a}|\mathbf{o}) = \sum_i w_i \mathcal{N}(\mu_i, \Sigma_i)$$ • Latent variable models • Autoregressive discretization ## Cost Function 0-1 cost function: $c(\mathbf{s}, \mathbf{a})=\left\{\begin{array}{l}{0 \text { if } \mathbf{a}=\pi^{\star}(\mathbf{s})} \\ {1 \text { otherwise }}\end{array}\right.$ Assume $$\pi_{\theta}\left(\mathbf{a} \neq \pi^{\star}(\mathbf{s}) | \mathbf{s}\right) \leq \epsilon$$ for all $$\mathbf{s} \in \mathcal{D}_{train}$$, then the expectation of cost is (Proof) $\mathbb{E}\left[ \sum_t c(\mathbf{s}_t, \mathbf{a}_t) \right] \le \sum_t (\epsilon + 2\epsilon t)$ which is $$\mathcal{O}(\epsilon T^2)$$, with DAgger $$p_{train}(\mathbf{s}) \rightarrow p_\theta(\mathbf{s})$$, the expectation is $$\mathcal{O}(T)$$. # Introduction to RL ## Markov Decision Process • $$\mathcal{M} = \{ \mathcal{S}, \mathcal{T} \}$$ • $$\mathcal{S}$$: state space • $$\mathcal{A}$$: action space • $$\mathcal{T}$$: transition operator • $$\mathcal{E}$$: emission probability $$p(\mathbf{o}_t | \mathbf{s}_t)$$ • $$r$$: $$\mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$$ reward function Let $$\mu_{t,j} = p(s_t = j)$$, $$\xi_{t,k} = p(a_t = k)$$, $$\mathcal{T}_{i,j,k} = p(s_{t+1} = i | s_t = j, a_t = k)$$, then $\mu_{t, i}=\sum_{j, k} \mathcal{T}_{i, j, k} \mu_{t, j} \xi_{t, k}$ Using the Markov chain we can get $\begin{equation}\underbrace{p_{\theta}\left(\mathbf{s}_{1}, \mathbf{a}_{1}, \ldots, \mathbf{s}_{T}, \mathbf{a}_{T}\right)}_{p_{\theta}(\tau)}=p\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} \pi_{\theta}\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right) p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)\label{markov}\end{equation}$ ## RL Definition The goal of RL is to find the optimal policy \DeclareMathOperator*{\argmin}{\arg\min}\DeclareMathOperator*{\argmax}{\arg\max}\begin{aligned}\theta^{\star} &= \argmax _{\theta} \mathbb{E}_{\tau \sim p_{\theta}(\tau)}\left[\sum_{t} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right] \\&= \argmax_{\theta} \sum_{t=1}^T \mathbb{E}_{(\mathbf{s}_t, \mathbf{a}_t) \sim p_{\theta}(\mathbf{s}_t, \mathbf{a}_t)}\left[r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right]\end{aligned} For infinite horizon case $$T = \infty$$, we can find the stationary distribution $$\mu = \mathcal{T} \mu$$, where $$\mu = p_{\theta}(\mathbf{s}, \mathbf{a})$$. Then the optimal policy can be represented as $\theta^{\star}=\argmax _{\theta} \frac{1}{T} \sum_{t=1}^{T} \mathbb{E}_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim p_{\theta}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right] \rightarrow \mathbb{E}_{(\mathbf{s}, \mathbf{a}) \sim p_{\theta}(\mathbf{s}, \mathbf{a})}[r(\mathbf{s}, \mathbf{a})]$ ## RL Algorithms [width=50%] ### Policy Gradient Directly differentiate the objective w.r.t. policy ### Value Based Rewrite the RL objective as conditional expectations: $\begin{equation}\begin{split}&\sum_{t=1}^T \mathbb{E}_{(\mathbf{s}_t, \mathbf{a}_t) \sim p_\theta(\mathbf{s}_t, \mathbf{a}_t)} [r(\mathbf{s}_t, \mathbf{a}_t)] \\=&\mathbb{E}_{\mathbf{s}_1\sim p(\mathbf{s}_1)} \biggl[ \mathbb{E}_{\mathbf{a}_1\sim \pi(\mathbf{a}_1|\mathbf{s}_1)} \Bigl[ r(\mathbf{s}_1,\mathbf{a}_1)+\mathbb{E}_{\mathbf{s}_2\sim p(\mathbf{s}_2|\mathbf{s}_1,\mathbf{a}_1)} \bigl[ \mathbb{E}_{\mathbf{a}_2\sim \pi(\mathbf{a}_2|\mathbf{s}_2)}[r(\mathbf{s}_2,\mathbf{a}_2)+\ldots|\mathbf{s}_2]|\mathbf{s}_1,\mathbf{a}_1 \bigr] |\mathbf{s}_1 \Bigr] \biggr]\end{split}\label{exp}\end{equation}$ Define $$Q$$-function as the total reward from taking $$\mathbf{a}_t$$ in $$\mathbf{s}_t$$: $Q^\pi(\mathbf{s}_t,\mathbf{a}_t)=\sum_{t'=t}^T\mathbb{E}_{\pi_\theta}[r(\mathbf{s}_{t'},\mathbf{a}_{t'})|\mathbf{s}_t,\mathbf{a}_t]$ Define Value function as the total reward from $$\mathbf{s}_t$$: \begin{aligned}V^\pi(\mathbf{s}_t) &= \sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(\mathbf{s}_{t'},\mathbf{a}_{t'})|\mathbf{s}_t] \\&= \mathbb{E}_{\mathbf{a}_t \sim \pi(\mathbf{a}_t | \mathbf{s}_t)} [Q^\pi (\mathbf{s}_t, \mathbf{a}_t)]\end{aligned} then the RL objective $$\eqref{exp}$$ can be represented as $$\mathbb{E}_{\mathbf{s}_1 \sim p(\mathbf{s}_1)} [V^\pi (\mathbf{s}_1)]$$ Estimate value function or $$Q$$-function of the optimal policy (no explicit policy) ### Actor-Critic Estimate value function or $$Q$$-function of the current policy, use it to improve policy ### Model-Based Estimate the transition model, and then 1. Just use the model to plan (no policy) • Trajectory optimization/optimal control (primarily in continuous spaces) • Monte Carlo tree search (discrete spaces) 2. Backpropagate gradients into the policy • Requires some tricks to make it work 3. Use the model to learn a value function • Dynamic programming • Generate simulated experience for model-free learner (Dyna) ## Evaluation When choosing the algorithm, consider: • Different trade-offs • Sample efficiency • on/off policy: Y/N require generating new samples to improve policy • Stable and unbiased: since RL is often not gradient descent • Different assumptions • Stochastic/deterministic • Continuous/discrete • Episodic/infinite horizon • Easier to represent the policy/model # Policy Gradient ## Algorithm Evaluate the objective by samples: $J(\theta) = \mathbb{E}_{\tau\sim \pi_\theta(\tau)}\left[\sum_tr(\mathbf{s}_t,\mathbf{a}_t)\right] \approx \frac{1}{N}\sum_i\sum_tr(\mathbf{s}_{i,t},\mathbf{a}_{i,t})$ Direct policy differentiation: (Proof) $\begin{equation}\nabla_{\theta} J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\left[\left(\sum_{t=1}^T\nabla_\theta\log \pi_\theta(\mathbf{a}_{i,t}|\mathbf{s}_{i,t})\right)\left(\sum_{t=1}^Tr(\mathbf{s}_{i,t},\mathbf{a}_{i,t})\right)\right]\label{policy}\end{equation}$ Compared to maximum likelihood in supervised learning: $\nabla_{\theta} J_{\mathrm{ML}}(\theta) \approx \frac{1}{N} \sum_{i=1}^{N}\left(\sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}\left(\mathbf{a}_{i, t} | \mathbf{s}_{i, t}\right)\right)$ Policy gradient is like a weighted maximum likelihood objective. So we get our REINFORCE algorithm: 1. Sample $$\{\tau^i\}$$ from $$\pi_\theta(\mathbf{a}_t | \mathbf{s}_t)$$ 2. Compute $$\nabla_\theta J(\theta)$$ 3. $$\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$$ ==Does not require the initial state distribution or the transition probabilities.== ==Can be used in POMDP (partially observed MDP) since Markov property is not use.== ## Variance One problem with policy gradient is its high variance: adding a constant to the reward function $$r(\tau)$$ will change the update process of the policy. ### Causality Since policy at time $$t'$$ cannot affect reward at time $$t$$ when $$t < t'$$, we can change the gradient $$\eqref{policy}$$ to $\begin{equation}\nabla_{\theta} J(\theta) \approx\frac{1}{N}\sum_{i=1}^N \sum_{t=1}^T\nabla_\theta\log \pi_\theta(\mathbf{a}_{i,t}|\mathbf{s}_{i,t}) \hat{Q}_{i,t}\label{rewardtogo}\end{equation}$ where $$\hat{Q}_{i,t} = \sum_{t'=t}^Tr(\mathbf{s}_{i,t'},\mathbf{a}_{i,t'})$$ is the reward-to-go. ### Baseline Update the policy by the difference between current policy reward and the average reward. $\begin{equation}\begin{split}&\nabla_{\theta} J(\theta) \approx \frac{1}{N}\sum_{i=1}^N \nabla_\theta\log \pi_\theta(\tau) [r(\tau) - b] \\&b = \frac{1}{N} \sum_{i=1}^N r(\tau)\end{split}\label{baseline}\end{equation}$ It is unbiased since the expectation of $$b$$ w.r.t. policy is $$0$$. The theoretically best baseline is to take the gradient of variance w.r.t. $$b$$ and find the optimal value: (Proof) $b^*=\frac{\mathbb{E}\left[\big(\nabla_\theta\log \pi_\theta(\tau)\big)^2 r(\tau)\right]}{\mathbb{E}\left[\big(\nabla_\theta\log \pi_\theta(\tau)\big)^2\right]}$ which is the expected reward weighted by gradient magnitudes. ## On/Off Policy Policy gradient is on-policy since it needs re-sampling every time the policy updates, which can be extremely inefficient. We can avoid re-sampling by using importance sampling: $\mathbb{E}_{x\sim p(x)}[f(x)]=\int p(x)f(x)\mathrm{d}x=\int q(x)\frac{p(x)}{q(x)}f(x)\mathrm{d}x=\mathbb{E}_{x\sim q(x)}\left[\frac{p(x)}{q(x)}f(x)\right]$ then express the objective of a new policy $$\theta'$$ as $J(\theta') = \mathbb{E}_{\tau \sim \pi_\theta(\tau)} \left[ \frac{\pi_{\theta'}(\tau)}{\pi_{\theta}(\tau)} r(\tau) \right]$ and take the gradient: (Proof) $\nabla_{\theta'}J(\theta') = \mathbb{E}_{\tau\sim \pi_\theta(\tau)}\left[\sum_{t=1}^T\nabla_{\theta'}\log \pi_{\theta'}(\mathbf{a}_t|\mathbf{s}_t)\left(\prod_{t'=1}^t\frac{\pi_{\theta'}(\mathbf{a}_{t'}|\mathbf{s}_{t'})}{\pi_{\theta}(\mathbf{a}_{t'}|\mathbf{s}_{t'})}\right)\left(\sum_{t'=t}^Tr(\mathbf{s}_{t'},\mathbf{a}_{t'})\right)\right]$ the problem is that the $$\prod$$ term is exponential in $$T$$, which might blow/kill the gradient. Rewrite $$J(\theta')$$ as expectation under state-action marginal and do importance sampling for both to get \begin{aligned}J(\theta') &= \sum_{t=1}^T\mathbb{E}_{\mathbf{s}_t\sim p_\theta(\mathbf{s}_t)}\left[\mathbb{E}_{\mathbf{a}_t\sim\pi_\theta(\mathbf{a}_t|\mathbf{s}_t)}[r(\mathbf{s}_t,\mathbf{a}_t)]\right] \\&= \sum_{t=1}^T\mathbb{E}_{\mathbf{s}_t\sim p_\theta(\mathbf{s}_t)}\left[\frac{p_{\theta'}(\mathbf{s}_t)}{p_{\theta}(\mathbf{s}_t)}\mathbb{E}_{\mathbf{a}_t\sim\pi_\theta(\mathbf{a}_t|\mathbf{s}_t)}\left(\frac{\pi_{\theta'}(\mathbf{a}_t|\mathbf{s}_t)}{\pi_{\theta}(\mathbf{a}_t|\mathbf{s}_t)}r(\mathbf{s}_t,\mathbf{a}_t)\right)\right] \\&\approx \sum_{t=1}^T\mathbb{E}_{\mathbf{s}_t\sim p_\theta(\mathbf{s}_t)}\left[\mathbb{E}_{\mathbf{a}_t\sim\pi_\theta(\mathbf{a}_t|\mathbf{s}_t)}\left(\frac{\pi_{\theta'}(\mathbf{a}_t|\mathbf{s}_t)}{\pi_{\theta}(\mathbf{a}_t|\mathbf{s}_t)}r(\mathbf{s}_t,\mathbf{a}_t)\right)\right]\end{aligned} We are sacrificing some accuracy for efficiency. ## Practice • Using much larger batches will help reducing variance. • Tweaking learning rates is very hard. Adaptive step size rules like ADAM is okay. # Actor-Critic ## Value Function Fitting If we can replace the $$\hat{Q}_{i,t}$$ in $$\eqref{rewardtogo}$$ by its expectation: $Q \left( \mathbf { s }_{ t } , \mathbf { a }_{ t } \right) = \sum_{ t ^ { \prime } = t } ^ { T } \mathbb{E} _{ \pi_ { \theta } } \left[ r ( \mathbf { s } _ { t ^ { \prime } } , \mathbf { a } _ { t ^ { \prime } }) | \mathbf { s } _ { t } , \mathbf { a } _ { t } \right]$ the policy update will have smaller variance: $\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\left[\nabla_\theta\log \pi_\theta(\mathbf{a}_{i,t}|\mathbf{s}_{i,t})Q(\mathbf{s}_t,\mathbf{a}_t)\right]$ Use the baseline $$b_t = V(\mathbf{s}_t) = \mathbb{E}_{\mathbf{a}_t \sim \pi_\theta (\mathbf{a}_t | \mathbf{s}_t)} Q(\mathbf{s}_t, \mathbf{a}_t)$$ ==(it is still unbiased as proved in the homework)==) $\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\left[\nabla_\theta\log \pi_\theta(\mathbf{a}_{i,t}|\mathbf{s}_{i,t})A^\pi(\mathbf{s}_{i, t},\mathbf{a}_{i, t})\right]$ where $$A^\pi(\mathbf{s}_t,\mathbf{a}_t)=Q^\pi(\mathbf{s}_t,\mathbf{a}_t)-V^\pi(\mathbf{s}_t)$$ is called the advantage function. Now we have 3 choices of function fitting: $$Q^\pi, V^\pi, A^\pi$$. Since $$V^\pi$$ ==is only a function of state rather than state-action pair==, most actor-critic algorithms choose value function fitting. Also $$Q^\pi, A^\pi$$ can be approximated by $$V^\pi$$: \begin{aligned}Q^\pi(\mathbf{s}_t,\mathbf{a}_t) &= r(\mathbf{s}_t,\mathbf{a}_t) + \sum_{t'=t+1}^T\mathbb{E}_{\pi_\theta}[r(\mathbf{s}_{t'},\mathbf{a}_{t'})|\mathbf{s}_t,\mathbf{a}_t] \\&= r(\mathbf{s}_t,\mathbf{a}_t) + \mathbb{E}_{\mathbf{s}_{t+1}\sim p(\mathbf{s_{t+1}}|\mathbf{s}_t,\mathbf{a}_t)}[V^\pi(\mathbf{s}_{t+1})] \\&\approx r(\mathbf{s}_t,\mathbf{a}_t)+V^\pi(\mathbf{s}_{t+1}) \\\\A^\pi(\mathbf{s}_t,\mathbf{a}_t) &\approx r(\mathbf{s}_t,\mathbf{a}_t)+V^\pi(\mathbf{s}_{t+1})-V^\pi(\mathbf{s}_t)\end{aligned} ## Policy Evaluation Policy evaluation tries to evaluate how good the policy is, e.g. estimating $$Q^\pi$$ or $$V^\pi$$. For example, we can train a $$\hat{V}_\phi^\pi$$ with parameter $$\phi$$ (neural network) to approximate $$V^\pi$$ by perform supervised regression $$\mathcal{L}(\phi)=\frac{1}{2}\sum_i\left\Vert\hat{V}_\phi^\pi(\mathbf{s}_i)-y_i\right\Vert^2$$ on the training data $$\left\{\left(\mathbf{s}_{i,t}, y_{i,t}\right)\right\}$$. How to get $$y_{i,t}$$? ### Monte Carlo \begin{aligned}y_{i,t} &= V^\pi(\mathbf{s}_{i,t})\approx\sum_{t'=t}^Tr(\mathbf{s}_{i,t'},\mathbf{a}_{i,t'}) \\y_{i,t} &= V^\pi(\mathbf{s}_{i,t})\approx\frac{1}{N}\sum_{i=1}^N\sum_{t'=t}^Tr(\mathbf{s}_{i,t'},\mathbf{a}_{i,t'})\end{aligned} The second one is better but requires resetting the simulator. ### Ideal Target The ideal choice for $$y_{i,t}$$ is the actual reward: $y_{i,t}=\sum_{t'=t}^T\mathbb{E}_{\pi_\theta}[r(\mathbf{s}_{t'},\mathbf{a}_{t'})|\mathbf{s}_{i,t}]\approx r(\mathbf{s}_{i,t},\mathbf{a}_{i,t})+V^\pi(\mathbf{s}_{i,t+1})\approx r(\mathbf{s}_{i,t},\mathbf{a}_{i,t})+\hat{V}_\phi^\pi(\mathbf{s}_{i,t+1})$ We can directly use previous fitted value function for $$\hat{V}_\phi^\pi(\mathbf{s}_{i,t+1})$$. ==We are trading off some accuracy for smaller variance.== Sometimes referred to as a "bootstrapped" estimate. ## Discount Factor For $$T = \infty$$, we have to introduce discount factor $$\gamma$$ to avoid $$\hat{V}_\phi^\pi$$ getting infinite. $y_{i,t}\approx r(\mathbf{s}_{i,t},\mathbf{a}_{i,t})+\gamma\hat{V}_\phi^\pi(\mathbf{s}_{i,t+1})$ For the policy gradient $$\eqref{policy}$$, it has two options to introduce $$\gamma$$. One is to discount the reward-to-go: $\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta\log\pi_\theta(\mathbf{a}_{i,t}|\mathbf{s}_{i,t})\left(\sum_{t'=t}^T\gamma^{t'-t}r(\mathbf{s}_{i,t'},\mathbf{a}_{i,t'})\right)$ the other is to derive from the beginning: \begin{aligned}\nabla_\theta J(\theta) &\approx \frac{1}{N}\sum_{i=1}^N\left[\left(\sum_{t=1}^T\nabla_\theta\log \pi_\theta(\mathbf{a}_{i,t}|\mathbf{s}_{i,t})\right)\left(\sum_{t=1}^T\gamma^{t-1}r(\mathbf{s}_{i,t},\mathbf{a}_{i,t})\right)\right] \\&= \nabla_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\gamma^{t-1}\nabla_\theta\log\pi_\theta(\mathbf{a}_{i,t}|\mathbf{s}_{i,t})\left(\sum_{t'=t}^T\gamma^{t'-t}r(\mathbf{s}_{i,t'},\mathbf{a}_{i,t'})\right)\end{aligned} In practice we often use the first option, since the second focus more on short-term reward. ## Algorithm Batch actor-critic algorithm: 1. sample $$\{\mathbf{s}_i,\mathbf{a}_i\}$$ from $$\pi_\theta(\mathbf{a}|\mathbf{s})$$ 2. fit $$\hat{V}_\phi^\pi(\mathbf{s})$$ to sample reward sums 3. evaluate $$\hat{A}^\pi(\mathbf{s}_i,\mathbf{a}_i)=r(\mathbf{s}_i,\mathbf{a}_i)+\gamma\hat{V}^\pi_\phi(\mathbf{s}_i')-\hat{V}^\pi_\phi(\mathbf{s}_i)$$ 4. $$\nabla_\theta J(\theta)\approx\sum_{t=1}^T\left[\nabla_\theta\log \pi_\theta(\mathbf{a}_t|\mathbf{s}_t)\hat{A}^\pi(\mathbf{s}_t,\mathbf{a}_t)\right]$$ 5. $$\theta\leftarrow \theta+\alpha\nabla_\theta J(\theta)$$ Online actor-critic algorithm: 1. take action $$\mathbf{a}\sim\pi_\theta(\mathbf{a}|\mathbf{s})$$, get $$(\mathbf{s},\mathbf{a},\mathbf{s}',r)$$ 2. update $$\hat{V}_\phi^\pi(\mathbf{s})$$ using target $$r+\gamma\hat{V}^\pi_\phi(\mathbf{s}')$$ 3. evaluate $$\hat{A}^\pi(\mathbf{s},\mathbf{a})=r(\mathbf{s},\mathbf{a})+\gamma\hat{V}^\pi_\phi(\mathbf{s}')-\hat{V}^\pi_\phi(\mathbf{s})$$ 4. $$\nabla_\theta J(\theta)\approx\nabla_\theta\log \pi_\theta(\mathbf{a}|\mathbf{s})\hat{A}^\pi(\mathbf{s},\mathbf{a})$$ 5. $$\theta\leftarrow \theta+\alpha\nabla_\theta J(\theta)$$ ## Architecture Design For actor-critic algorithm, we now have two neural networks to train: $$\mathbf{s} \rightarrow \pi_\theta(\mathbf{a} | \mathbf{s})$$ and $$\mathbf{s} \rightarrow \hat{V}_\phi^\pi(\mathbf{s})$$. • Train two network design separately: simple & stable, but inefficient. • Shared network design: difficult to train. For online algorithm, step 2 and 4 works best with a batch (reducing variance), so we can make it parallel • Synchronized parallel • Asynchronous parallel ## Improvements ### Critic Baseline Actor-critic has lower variance but is biased, policy gradient is unbiased but has higher variance. We can use $$\hat{V}_\phi^\pi$$ but still keep the estimator unbiased: $\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta\log\pi_\theta(\mathbf{a}_{i,t}|\mathbf{s}_{i,t})\left(\left(\sum_{t'=t}^T\gamma^{t'-t}r(\mathbf{s}_{i,t'},\mathbf{a}_{i,t'})\right)-\hat{V}^\pi_\phi(\mathbf{s}_{i,t})\right)$ ==This is like using the baseline that depends on state.== $\hat{A}^\pi(\mathbf{s}_t,\mathbf{a}_t)=\sum_{t'=t}^\infty \gamma^{t'-t}r(\mathbf{s}_{t'},\mathbf{a}_{t'})-V^\pi_\phi(\mathbf{s}_t)$ We can also use baseline that depends on action: $\sum_{t'=t}^\infty \gamma^{t'-t}r(\mathbf{s}_{t'},\mathbf{a}_{t'})-Q^\pi_\phi(\mathbf{s}_t,\mathbf{a}_t)$ but it is unbiased only if the critic is correct. We can modify the model to make it unbiased, provided the second term can be evaluated: $\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta\log\pi_\theta(\mathbf{a}_{i,t}|\mathbf{s}_{i,t})\left(\hat{Q}_{i,t}-Q^\pi_\phi(\mathbf{s}_{i,t},\mathbf{a}_{i,t})\right)+\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta\mathbb{E}_{\mathbf{a}\sim\pi_\theta(\mathbf{a}_t|\mathbf{s}_{i,t})}[Q^\pi_\phi(\mathbf{s}_{i,t},\mathbf{a}_t)]$ ### Generalized Advantage Estimation • Critic: low variance, high bias if value is wrong • Monte Carlo: no bias, high variance. \begin{aligned}\hat{A}_\text{C}^\pi(\mathbf{s}_t,\mathbf{a}_t) &= r(\mathbf{s}_t,\mathbf{a}_t)+\gamma\hat{V}^\pi_\phi(\mathbf{s}_{t+1})-\hat{V}^\pi_\phi(\mathbf{s}_t) \\\hat{A}_\text{MC}^\pi(\mathbf{s}_t,\mathbf{a}_t) &= \sum_{t'=t}^\infty \gamma^{t'-t}r(\mathbf{s}_{t'},\mathbf{a}_{t'})-\hat{V}^\pi_\phi(\mathbf{s}_t)\end{aligned} We can define n-step return to combine these two: $\hat{A}_n^\pi(\mathbf{s}_t,\mathbf{a}_t)=\sum_{t'=t}^{t+n}\gamma_{t'-t}r(\mathbf{s}_{t'},\mathbf{a}_{t'})+\gamma^n\hat{V}^\pi_\phi(\mathbf{s}_{t+n})-\hat{V}^\pi_\phi(\mathbf{s}_t)$ and the weighted combination of n-step returns, GAE (generalized advantage estimation): \begin{aligned}\hat{A}_\text{GAE}^\pi(\mathbf{s}_t,\mathbf{a}_t) &= \sum_{n=1}^\infty w_n \hat{A}_n^\pi(\mathbf{s}_t, \mathbf{a}_t) \\&= \sum_{t'=t}^\infty (\gamma\lambda)^{t'-t}\left[r(\mathbf{s}_{t'},\mathbf{a}_{t'})+\gamma\hat{V}_\phi^\pi(\mathbf{s}_{t'+1})-\hat{V}_\phi^\pi(\mathbf{s}_{t'})\right]\end{aligned} where $$w_n \propto \lambda^{n-1}$$ is the weight. # Value Function Methods ## Policy Iteration Omit policy gradient completely, only use the critic: 1. evaluate $$A^\pi(\mathbf{s}, \mathbf{a})$$ 2. set $$\pi'(\mathbf{a}_t|\mathbf{s}_t)=I\left(\mathbf{a}_t=\argmax_{\mathbf{a}_t}A^\pi(\mathbf{s}_t,\mathbf{a}_t)\right)$$ Use dynamic programming to do step 1. As before, represent $$A^\pi(\mathbf{s}, \mathbf{a})$$ by $$V^\pi(\mathbf{s})$$ since the latter is easier to evaluate. Bootstrapped update: $$V^\pi(\mathbf{s})\leftarrow\mathbb{E}_{\mathbf{a}\sim\pi(\mathbf{a}|\mathbf{s})}[r(\mathbf{s},\mathbf{a})+\gamma\mathbb{E}_{\mathbf{s}'\sim p(\mathbf{s}'|\mathbf{s},\mathbf{a})}[V^\pi(\mathbf{s}')]]$$. With deterministic policy $$\pi(\mathbf{s}) = \mathbf{a}$$, it can be simplified to $$V^\pi(\mathbf{s})\leftarrow r(\mathbf{s},\pi(\mathbf{s}))+\gamma\mathbb{E}_{\mathbf{s}'\sim p(\mathbf{s}'|\mathbf{s},\pi(\mathbf{s}))}[V^\pi(\mathbf{s}')]$$. So we have the policy iteration algorithm: 1. evaluate $$V^\pi(\mathbf{s})$$ by DP 2. set $$\pi'(\mathbf{a}_t|\mathbf{s}_t)=I\left(\mathbf{a}_t=\argmax_{\mathbf{a}_t}A^\pi(\mathbf{s}_t,\mathbf{a}_t)\right)$$ ## Value Iteration We can further simplify dynamic programming by updating the $$Q$$-function and $$V$$-function iteratively. Notice that $$\argmax_{\mathbf{a}_t}A^\pi(\mathbf{s}_t,\mathbf{a}_t)=\argmax_{\mathbf{a}_t}Q^\pi(\mathbf{s}_t,\mathbf{a}_t)$$ 1. set $$Q(\mathbf{s},\mathbf{a})=r(\mathbf{s},\mathbf{a})+\gamma\mathbb{E}[V(\mathbf{s}')]$$ 2. set $$V(\mathbf{s})=\max_\mathbf{a}Q(\mathbf{s},\mathbf{a})$$ ### Fitted Value Iteration Use a neural net to represent $$Q$$ and $$V$$ by using loss function: $\mathcal{L}(\phi)=\frac{1}{2}\left\Vert V_\phi(\mathbf{s})-\max_\mathbf{a}Q^\pi(\mathbf{s},\mathbf{a})\right\Vert^2$ The value iteration can be rewritten as fitted value iteration: 1. set $$y_i\leftarrow\max_{\mathbf{a}_i}(r(\mathbf{s}_i,\mathbf{a}_i)+\gamma\mathbb{E}[V_\phi(\mathbf{s}_i')])$$ 2. set $$\phi\leftarrow\argmin_\phi\frac{1}{2}\sum_i\left\Vert V_\phi(\mathbf{s}_i)-y_i\right\Vert^2$$ ### Fitted $$Q$$-Iteration Step 1 requires resetting the simulator for different actions. Fitted $$Q$$-iteration can fix this problem: 1. collect dataset $$\{(\mathbf{s}_i,\mathbf{a}_i,r_i,\mathbf{s}'_i)\}$$ 2. set $$y_i\leftarrow r(\mathbf{s}_i,\mathbf{a}_i)+\gamma\mathbb{E}[V_\phi(\mathbf{s}_i')]$$ where $$\mathbb{E}[V_\phi(\mathbf{s}_i')] = \max_{\mathbf{a}'}Q_\phi(\mathbf{s}_i',\mathbf{a}')$$ 3. set $$\phi\leftarrow\argmin_\phi\frac{1}{2}\sum_i\left\Vert Q_\phi(\mathbf{s}_i,\mathbf{a}_i)-y_i\right\Vert^2$$ It is off-policy and only one network (no high-variance problem like policy gradient). No convergence guarantees for non-linear function approximation. ### Exploration We can make fitted $$Q$$-iteration online: 1. take some action $$\mathbf{a}_t$$ and observe $$(\mathbf{s}_i,\mathbf{a}_i,r_i,\mathbf{s}'_i)$$ 2. set $$y_i\leftarrow r(\mathbf{s}_i,\mathbf{a}_i)+\gamma\max_{\mathbf{a}'}Q_\phi(\mathbf{s}_i',\mathbf{a}')$$ 3. set $$\phi\leftarrow \phi - \alpha \frac{dQ_\phi}{d\phi}(\mathbf{s}_i,\mathbf{a}_i) (Q_\phi(\mathbf{s}_i,\mathbf{a}_i) - y_i)$$ But it can easily get stuck in local minimum since we take the max action at each step. We can fix this by using epsilon-greedy: $\pi(\mathbf{a}_t | \mathbf{s}_t) = \begin{cases}1 - \epsilon \quad \text{if } \mathbf{a}_t = \argmax_{\mathbf{a}_t} Q_\phi(\mathbf{s}_t, \mathbf{a}_t) \\\epsilon / (|\mathcal{A} + 1|) \quad \text{ otherwise}\end{cases}$ or Boltzmann exploration: $\pi\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right) \propto \exp \left(Q_{\phi}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right)$ ## Value Iteration Theory Define operator $$\mathcal{B}$$: $$\mathcal{B}V=\max_\mathbf{a}r_\mathbf{a}+\gamma\mathcal{T}_\mathbf{a}V$$, where $$\mathcal{T}_\mathbf{a}$$ is the matrix of transitions for action $$\mathbf{a}$$. Then $$V^{\star}(\mathbf{s})=\max _{\mathbf{a}} r(\mathbf{s}, \mathbf{a})+\gamma E\left[V^{\star}\left(\mathbf{s}^{\prime}\right)\right]$$ is a fixed point of $$\mathcal{B}$$, which always exists and is unique. $$\mathcal{B}$$ is a contraction w.r.t. $$\infty$$-norm: $$\Vert\mathcal{B}V-\mathcal{B}\bar{V}\Vert_\infty\leq\gamma\Vert V-\bar{V}\Vert_\infty$$. Value iteration can be represented by $$V \leftarrow \mathcal{B} V$$, which converges. Define $$\Pi$$: $$\Pi V=\argmin_{V'\in\Omega}\frac{1}{2}\sum\Vert V'(\mathbf{s})-V(\mathbf{s})\Vert^2$$. $$\Pi$$ is a contraction w.r.t. $$\ell_2$$-norm $$\Vert\Pi V-\Pi \bar{V}\Vert_2\leq\gamma\Vert V-\bar{V}\Vert_2$$. Fitted value iteration can be represented by $$V \leftarrow \Pi \mathcal{B} V$$, but $$\Pi \mathcal{B}$$ is not a contraction of any kind. Same applies to $$Q$$ iteration and fitted $$Q$$ iteration. ==Online $$Q$$ iteration is not gradient descent since $$y_i$$ depends on $$\phi$$, not converge.== Also applies to batch actor-critic algorithm # $$Q$$-Learning in RL ## Decorrelation The problem with online $$Q$$-iteration is that it uses sequential states, which are strongly correlated. So the neural network will overfit for local regions. Parallelizing (synchronized, asynchronous) can help to decorrelate the data. There is another solution called the Replay Buffer. Since $$Q$$ learning is off-policy, we can sample data from a buffer and perform gradient update: 1. collect dataset $$\{ (\mathbf{s}_i, \mathbf{a}_i, \mathbf{s}'_i, \mathbf{a}'_i) \}$$ using some policy, add it to buffer $$\mathcal{B}$$ 1. sample a batch $$(\mathbf{s}_j, \mathbf{a}_j, \mathbf{s}'_j, \mathbf{a}'_j)$$ from $$\mathcal{B}$$ 2. compute $$y_j = r_j + \gamma \max_{\mathbf{a}_j'} Q_{\phi}(\mathbf{s}'_j, \mathbf{a}_j')$$ 3. $$\phi\leftarrow\phi-\alpha\sum_j\frac{\mathrm{d} Q_\phi}{\mathrm{d}\phi}(\mathbf{s}_j,\mathbf{a}_j)\left(Q_\phi(\mathbf{s}_j,\mathbf{a}_j) - y_j \right)$$ ## Deep $$Q$$-Learning The step 3 of Replay Buffer is not stable since the target value $$y_j$$ is changing every time gradient updates. We can fix the target value by using previous network during gradient update: 1. update $$\phi' \leftarrow \phi$$ 1. collect dataset $$\{ (\mathbf{s}_i, \mathbf{a}_i, \mathbf{s}'_i, \mathbf{a}'_i) \}$$ using some policy, add it to $$\mathcal{B}$$ 1. sample a batch $$(\mathbf{s}_j, \mathbf{a}_j, \mathbf{s}'_j, \mathbf{a}'_j)$$ from $$\mathcal{B}$$ 2. compute $$y_j = r_j + \gamma \max_{\mathbf{a}'_j} Q_{\phi'}(\mathbf{s}'_j, \mathbf{a}'_j)$$ using target network $$Q_{\phi'}$$ 3. $$\phi\leftarrow\phi-\alpha\sum_j\frac{\mathrm{d} Q_\phi}{\mathrm{d}\phi}(\mathbf{s}_j,\mathbf{a}_j)\left(Q_\phi(\mathbf{s}_j,\mathbf{a}_j) - y_j \right)$$ To make $$\phi'$$ update more smoothly, we can use Polyak averaging: $$\phi' \leftarrow \tau \phi' + (1 - \tau) \phi$$ every time $$\phi$$ updates, where $$\tau = 0.999$$ works well in practice. ## Comparison • Online $$Q$$-learning: evict immediately, process 1,2,3 all run at the same speed • DQN: process 1 and 3 run at the same speed, process 2 is slow • Fitted $$Q$$-iteration: process 3 in the inner loop of 2, which is in the inner loop of 1. ## Double $$Q$$-Learning DQN often overestimate the $$Q$$ value because of the max in target value: $$y_j=r_j+\gamma\max_{\mathbf{a}_j'}Q_{\phi'}(\mathbf{s}_j',\mathbf{a}_j')$$. Since $$\mathbb{E}[\max(X_1,X_2)]\geq\max(\mathbb{E}[X_1],\mathbb{E}[X_2])$$. Double $$Q$$-learning uses two network to decorrelate the noise in actions from the noise in $$Q$$-value: \begin{aligned}Q_{\phi_A}(\mathbf{s},\mathbf{a}) &\leftarrow r+\gamma Q_{\phi_B}(\mathbf{s}',\argmax_{\mathbf{a}'}Q_{\phi_A}(\mathbf{s}',\mathbf{a}')) \\Q_{\phi_B}(\mathbf{s},\mathbf{a}) &\leftarrow r+\gamma Q_{\phi_A}(\mathbf{s}',\argmax_{\mathbf{a}'}Q_{\phi_B}(\mathbf{s}',\mathbf{a}'))\end{aligned} We can keep using the current and target networks in DQN: • Standard $$Q$$-learning: $$y=r+\gamma Q_{\phi'}(\mathbf{s}',\argmax_{\mathbf{a}'}Q_{\phi'}(\mathbf{s}',\mathbf{a}'))$$ • Double $$Q$$-learning: $$y=r+\gamma Q_{\phi'}(\mathbf{s}',\argmax_{\mathbf{a}'}Q_\phi(\mathbf{s}',\mathbf{a}'))$$ ## Multi-Step Return $$Q$$-learning has low variance but is biased if the $$Q$$-value is incorrect, so we can use multi-step reward to make it more accurate: $y_{j,t}=\sum_{t'=t}^{t+N-1} \gamma^{t - t'} r_{j,t'}+\gamma^N\max_{\mathbf{a}_{j,t+N}}Q_{\phi'}(\mathbf{s}_{j,t+N},\mathbf{a}_{j,t+N})$ and this target value can make learning faster, especially early on. But it is only correct when learning on-policy since we need $$\mathbf{s}_{j,t'},\mathbf{a}_{j,t'},\mathbf{s}_{j,t'+1}$$ to come from $$\pi$$ for $$t' - t < N - 1$$. Solutions: • ignore the problem (often works very well for small $$N$$) • cut the trace: dynamically choose $$N$$ to get only on-policy data (works well when action space is small) • importance sampling ## Continuous Actions So far we just assume $$\mathcal{A}$$ is discrete and we can compute \begin{aligned}\pi(\mathbf{a}_t|\mathbf{s}_t) &= I\left(\mathbf{a}_t=\argmax_{\mathbf{a}_t}Q_\phi(\mathbf{s}_t,\mathbf{a}_t)\right) \\y_j &= r_j+\gamma\max_{\mathbf{a}_j'}Q_{\phi'}(\mathbf{s}_j',\mathbf{a}_j')\end{aligned} by traversing all actions. If $$\mathcal{A}$$ is continuous, we can • gradient based optimization (e.g. SGD) slow in the inner loop • stochastic optimization, since $$\mathcal{A}$$ is typically low-dimensional • sample actions from some distribution • cross-entropy method (CEM): simple iterative stochastic optimization • CMA-ES • approximate $$Q$$ by an easily optimizable function, e.g. Normalized advantage function (NAF): $$Q_\phi(\mathbf{s},\mathbf{a})=-\frac{1}{2}(\mathbf{a}-\mu_\phi(\mathbf{s}))^\top P_\phi(\mathbf{s})(\mathbf{a}-\mu_\phi(\mathbf{s}))+V_\phi(\mathbf{s})$$ • Deep deterministic policy gradient (DDPG): train another network $$\mu_\theta(\mathbf{s})$$ such that $$\mu_\theta(\mathbf{s})\approx\argmax_\mathbf{a}Q_\phi(\mathbf{s},\mathbf{a})$$ ### DDPG 1. take some action $$\mathbf{a}_i$$ and observe $$(\mathbf{s}_i, \mathbf{a}_i, \mathbf{s}'_i)$$, add it to $$\mathcal{B}$$ 2. sample a batch $$(\mathbf{s}_j, \mathbf{a}_j, \mathbf{s}'_j)$$ from $$\mathcal{B}$$ 3. compute $$y_j = r_j + \gamma \max_{\mathbf{a}'_j} Q_{\phi'}(\mathbf{s}'_j, \mu_{\theta'}(\mathbf{s}'_j))$$ using target network $$Q_{\phi'}$$ and $$\mu_{\theta'}$$ 4. $$\phi\leftarrow\phi-\alpha\sum_j\frac{\mathrm{d} Q_\phi}{\mathrm{d}\phi}(\mathbf{s}_j,\mathbf{a}_j)\left(Q_\phi(\mathbf{s}_j,\mathbf{a}_j) - y_j \right)$$ 5. $$\theta\leftarrow\theta+\beta\sum_j\frac{\mathrm{d}\mu}{\mathrm{d}\theta} (\mathbf{s}_j) \frac{\mathrm{d}Q}{\mathrm{d}\mathbf{a}} (\mathbf{s}_j,\mathbf{a})$$ 6. update $$\phi'$$ and $$\theta'$$ (e.g. Polyak averaging) ## Practical Tips • $$Q$$-learning takes some care to stabilize, so test on easy, reliable tasks first. • Large replay buffers help improve stability • Converge slowly, can be like random at first • Schedule exploration (high to low) by $$\epsilon$$ and learning rates (high to low), Adam can help • Bellman error gradients can be large; clip gradients or use Huber loss • Double $$Q$$-learning helps a lot in practice, simple and no downsides • $$N$$-step return also helps a lot, but have some downsides • Test on multiple random seeds, it can be very inconsistent between runs # Advanced Policy Gradient ## Policy Gradient as Policy Iteration With the objective $J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_t \gamma^t r(\mathbf{s}_t, \mathbf{a}_t) \right]$ We want to maximize $$J(\theta') - J(\theta)$$, where \begin{aligned}J(\theta')-J(\theta) &=J(\theta')-\mathbb{E}_{\mathbf{s}_{0} \sim p(\mathbf{s}_{0})}\left[V^{\pi_{\theta}}(\mathbf{s}_{0})\right] \\&=J(\theta')-\mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[V^{\pi_{\theta}}(\mathbf{s}_{0})\right] \\&=J(\theta')-\mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty} \gamma^{t} V^{\pi_{\theta}}(\mathbf{s}_{t}) - \sum_{t=1}^{\infty} \gamma^{t} V^{\pi_{\theta}}(\mathbf{s}_{t})\right] \\&=\mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty} \gamma^{t} \big( r(\mathbf{s}_{t}, \mathbf{a}_{t})+\gamma V^{\pi_{\theta}}(\mathbf{s}_{t+1})-V^{\pi_{\theta}}(\mathbf{s}_{t}) \big) \right] \\&=\mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty} \gamma^{t} A^{\pi_{\theta}}(\mathbf{s}_{t}, \mathbf{a}_{t})\right]\end{aligned} ==The second line is because the expectation only depends on the initial state distribution.== The advantage is under $$\pi_\theta$$, but expectation is under $$\pi_{\theta'}$$, so use importance sampling: \begin{aligned}\mathbb{E}_{\tau \sim p_{\theta^{\prime}}(\tau)}\left[\sum_{t=0}^{\infty} \gamma^{t} A^{\pi_{\theta}}(\mathbf{s}_{t}, \mathbf{a}_{t})\right] &= \sum_t \mathbb{E}_{\mathbf{s}_t \sim p_{\theta'}(\mathbf{s}_t)} \left[ \mathbb{E}_{\mathbf{a}_t \sim \pi_{\theta'}(\mathbf{a}_t | \mathbf{s}_t)} \left[ \gamma^t A^{\pi_\theta}(\mathbf{s}_t, \mathbf{a}_t) \right] \right] \\&= \sum_t \mathbb{E}_{\mathbf{s}_t \sim p_{\theta'}(\mathbf{s}_t)} \left[ \mathbb{E}_{\mathbf{a}_t \sim \pi_{\theta}(\mathbf{a}_t | \mathbf{s}_t)} \left[ \frac{\pi_{\theta'}(\mathbf{a}_t | \mathbf{s}_t)}{\pi_{\theta}(\mathbf{a}_t | \mathbf{s}_t)} \gamma^t A^{\pi_\theta}(\mathbf{s}_t, \mathbf{a}_t) \right] \right]\end{aligned} The $$p_{\theta'}(\mathbf{s}_t)$$ here is still a problem, if we can substitute with $$p_\theta(\mathbf{s}_t)$$, we can easily maximizing this expression. $J(\theta') - J(\theta) \approx \bar{A}(\theta') \quad \Rightarrow \quad \theta' \leftarrow \argmax_{\theta'} \bar{A}(\theta')$ where $\bar{A}(\theta') = \sum_t \mathbb{E}_{\mathbf{s}_t \sim p_{\theta}(\mathbf{s}_t)} \left[ \mathbb{E}_{\mathbf{a}_t \sim \pi_{\theta}(\mathbf{a}_t | \mathbf{s}_t)} \left[ \frac{\pi_{\theta'}(\mathbf{a}_t | \mathbf{s}_t)}{\pi_{\theta}(\mathbf{a}_t | \mathbf{s}_t)} \gamma^t A^{\pi_\theta}(\mathbf{s}_t, \mathbf{a}_t) \right] \right]$ ### Bounding the Objective The distribution mismatch between $$p_\theta(\mathbf{s}_t)$$ and $$p_{\theta'}(\mathbf{s}_t)$$ can be ignored when $$\pi_\theta$$ and $$\pi_{\theta'}$$ are close. Similar to imitation learning, if $$\pi_{\theta'}$$ is close to $$\pi_\theta$$: $$\pi_{\theta'}(\mathbf{a}_t \neq \pi_{\theta}(\mathbf{s}_t) | \mathbf{s}_t) \leq \epsilon$$, then $\left|p_{\theta'}(\mathbf{s}_{t}) - p_{\theta}(\mathbf{s}_{t})\right| \leq 2\epsilon t$ and the objective is bounded by $\mathbb{E}_{\tau \sim p_{\theta^{\prime}}(\tau)}\left[\sum_{t=0}^{\infty} \gamma^{t} A^{\pi_{\theta}}(\mathbf{s}_{t}, \mathbf{a}_{t})\right] \ge \bar{A}(\theta') - \sum_t 2\epsilon t C$ where $$C$$ is a constant of $$\mathcal{O}(T r_\max)$$ or if with discount, $$\mathcal{O}(r_\max / (1 - \gamma))$$. ### KL Divergence Use KL divergence to give a more convenient bound: $\left| \pi _{ \theta ^ { \prime } } \left( \mathbf { a }_ { t } | \mathbf { s } _{ t } \right) - \pi_ { \theta } \left( \mathbf { a } _{ t } | \mathbf { s }_ { t } \right) \right| \leq \sqrt { \frac { 1 } { 2 } D _{ \mathrm { KL } } \left( \pi_ { \theta ^ { \prime } } \left( \mathbf { a } _{ t } | \mathbf { s }_ { t } \right) \| \pi _{ \theta } \left( \mathbf { a }_ { t } | \mathbf { s } _{ t } \right) \right) }$ where $D_ { \mathrm { KL } } \left( p _{ 1 } ( x ) \| p_ { 2 } ( x ) \right) = \mathbb{E} _{ x \sim p_ { 1 } ( x ) } \left[ \log \frac { p _ { 1 } ( x ) } { p _ { 2 } ( x ) } \right]$ We can solve the constraint optimization problem $\begin{equation}\begin{split}&\theta' = \argmax_{\theta'} \bar{A}(\theta') \\\text{s.t.} \quad &D_ { \mathrm { KL } } \left( \pi _{ \theta ^ { \prime } } \left( \mathbf { a }_ { t } | \mathbf { s } _{ t } \right) \| \pi_ { \theta } \left( \mathbf { a } _{ t } | \mathbf { s }_ { t } \right) \right) \le \epsilon\end{split}\label{opt}\end{equation}$ ## Algorithms ### Dal Gradient Descent Solve $$\eqref{opt}$$ by first define the Lagrangian $\mathcal { L } ( \theta ^ { \prime } , \lambda ) = \bar{A}(\theta') - \lambda \left( D _{ \mathrm { KL } } \left( \pi_ { \theta ^ { \prime } } ( \mathbf { a } _{ t } | \mathbf { s }_ { t } ) \| \pi _{ \theta } ( \mathbf { a }_ { t } | \mathbf { s } _ { t } ) \right) - \epsilon \right)$ then update with two steps: 1. maximize $$\mathcal{L}(\theta', \lambda)$$ w.r.t. $$\theta'$$ 2. $$\lambda \leftarrow \lambda + \alpha\left( D _{ \mathrm { KL } } \left( \pi_ { \theta ^ { \prime } } ( \mathbf { a } _{ t } | \mathbf { s }_ { t } \| \pi _{ \theta } ( \mathbf { a }_ { t } | \mathbf { s } _ { t } ) \right) - \epsilon \right)$$ ### Natural Policy Gradient We can solve $$\theta' = \argmax_{\theta'} \bar{A}(\theta')$$ by using the first order Taylor approximation: $\theta' \leftarrow \argmax_{\theta'} \nabla_\theta \bar{A}(\theta)^T (\theta' - \theta)$ and \begin{aligned}\nabla_\theta \bar{A}(\theta) &= \sum_t \mathbb{E}_{\mathbf{s}_t \sim p_{\theta}(\mathbf{s}_t)} \left[ \mathbb{E}_{\mathbf{a}_t \sim \pi_{\theta}(\mathbf{a}_t | \mathbf{s}_t)} \left[ \gamma^t \nabla_\theta \log \pi_{\theta}(\mathbf{a}_t | \mathbf{s}_t) A^{\pi_\theta}(\mathbf{s}_t, \mathbf{a}_t) \right] \right] \\&= \nabla_\theta J(\theta)\end{aligned} ==which is exactly the normal policy gradient.== To deal with constraints, first approximate by second order Taylor expansion: $D _{ \mathrm { KL } } \left( \pi_ { \theta ^ { \prime } } \| \pi _{ \theta } \right) \approx \frac{1}{2} (\theta' - \theta)^T \mathbf{F} (\theta' - \theta)$ where $$\mathbf{F}$$ is the Fisher-information matrix $\mathbf{F} = \mathbb{E}_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta \nabla_\theta \log \pi_\theta^T \right]$ which can be estimated with samples. Then $$\eqref{opt}$$ becomes \begin{aligned}&\theta' = \argmax_{\theta'} \nabla_\theta J(\theta)^T (\theta' - \theta) \\\text{s.t.} \quad &\frac{1}{2} (\theta' - \theta)^T \mathbf{F} (\theta' - \theta) \le \epsilon\end{aligned} and the solution is $\theta' = \theta + \sqrt{\frac{2 \epsilon}{\nabla_\theta J(\theta)^T \mathbf{F} \nabla_\theta J(\theta)}} \mathbf{F}^{-1} \nabla_\theta J(\theta)$ # Proof ## Proof 1 With $$\pi_{\theta}\left(\mathbf{a} \neq \pi^{\star}(\mathbf{s}) | \mathbf{s}\right) \leq \epsilon$$, we have $p_{\theta}(\mathbf{s}_{t}) = (1-\epsilon)^{t} p_{\text {train}}\left(\mathbf{s}_{t}\right)+\left(1-(1-\epsilon)^{t}\right) p_{\text {mistake}}(\mathbf{s}_{t}) \\$ then \begin{aligned}\left|p_{\theta}(\mathbf{s}_{t})-p_{\text{train}}(\mathbf{s}_{t})\right| &= \left(1-(1-\epsilon)^{t}\right) \left|p_{\text {mistake}}(\mathbf{s}_{t})-p_{\text {train}}(\mathbf{s}_{t})\right| \\& \leq 2\left(1-(1-\epsilon)^{t}\right) \\& \leq 2\epsilon t\end{aligned} so the cost \begin{aligned}\sum_t \mathbb{E}_{p_\theta(\mathbf{s}_t)}\left[ c_t \right] &= \sum_t\sum_{\mathbf{s}_t} p_\theta(\mathbf{s}_t) c_t(\mathbf{s}_t) \\&\le \sum_t\sum_{\mathbf{s}_t} \Big[ p_{\text{train}}(\mathbf{s}_t) c_t(\mathbf{s}_t) + |p_\theta(\mathbf{s}_t) - p_{\text{train}}(\mathbf{s}_t)| c_{\max} \Big] \\&\le \sum_t (\epsilon + 2\epsilon t)\end{aligned} ## Proof 2 With $$J(\theta) = \int \pi_{\theta}(\tau) r(\tau) d \tau$$, taking the gradient: \begin{aligned}\nabla_{\theta} J(\theta) &= \int \nabla \pi_{\theta}(\tau) r(\tau) d \tau \\&= \int \pi_{\theta}(\tau) \nabla_{\theta} \log \pi_{\theta}(\tau) r(\tau) d \tau \\&= \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)}\left[\nabla_{\theta} \log \pi_{\theta}(\tau) r(\tau)\right]\end{aligned} Expand $$\log \pi_\theta(\tau)$$ to $\log \pi_{\theta}(\tau)=\log p\left(\mathbf{s}_{1}\right)+\sum_{t=1}^{T} \log \pi_{\theta}\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)+\log p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)$ so taking the gradient w.r.t. $$\theta$$ only left the second term. Plug in Markov chain $$\eqref{markov}$$ we can get \begin{aligned}\nabla_{\theta} J(\theta) &= \mathbb{E}_{\tau \sim \pi_\theta(\tau)}\left[\left(\sum_{t=1}^T\nabla_\theta\log \pi_\theta(\mathbf{a}_{t}|\mathbf{s}_{t})\right)\left(\sum_{t=1}^Tr(\mathbf{s}_{t},\mathbf{a}_{t})\right)\right] \\&\approx\frac{1}{N}\sum_{i=1}^N\left[\left(\sum_{t=1}^T\nabla_\theta\log \pi_\theta(\mathbf{a}_{i,t}|\mathbf{s}_{i,t})\right)\left(\sum_{t=1}^Tr(\mathbf{s}_{i,t},\mathbf{a}_{i,t})\right)\right]\end{aligned} ## Proof 3 With $\nabla _{ \theta } J ( \theta ) = \mathbb{E}_ { \tau \sim \pi _{ \theta } ( \tau ) } \left[ \nabla _ { \theta } \log \pi _ { \theta } ( \tau ) ( r ( \tau ) - b ) \right]$ the variance is \begin{aligned}\text{Var}(x) &= \mathbb{E}(x^2) - \mathbb{E}(x)^2 \\&= \mathbb{E}_ { \tau \sim \pi _{ \theta } ( \tau ) } \left[ \big( \nabla _ { \theta } \log \pi _ { \theta } ( \tau ) ( r ( \tau ) - b ) \big) ^ { 2 } \right] - \mathbb{E}_ { \tau \sim \pi _{ \theta } ( \tau ) } \left[ \nabla _ { \theta } \log \pi _ { \theta } ( \tau ) ( r ( \tau ) - b ) \right] ^ { 2 } \\&= \mathbb{E}_ { \tau \sim \pi _{ \theta } ( \tau ) } \left[ \big( \nabla _ { \theta } \log \pi _ { \theta } ( \tau ) ( r ( \tau ) - b ) \big) ^ { 2 } \right] - \mathbb{E}_ { \tau \sim \pi _ { \theta } ( \tau ) } \left[ \nabla _ { \theta } \log \pi _ { \theta } ( \tau ) r ( \tau ) \right] ^ { 2 }\end{aligned} since $$b$$ is unbiased. Let $$g(\tau) = \nabla_{ \theta } \log \pi_ { \theta } ( \tau )$$, compute the derivative w.r.t. $$b$$ \begin{aligned}\frac { d \text{Var} } { d b } &= \frac { d } { d b } \mathbb{E} \left[ g ( \tau ) ^ { 2 } ( r ( \tau ) - b ) ^ { 2 } \right] \\& = \frac { d } { d b } \left( \mathbb{E} \left[ g ( \tau ) ^ { 2 } r ( \tau ) ^ { 2 } \right] - 2 \mathbb{E} \left[ g ( \tau ) ^ { 2 } r ( \tau ) b \right] + b ^ { 2 } \mathbb{E} \left[ g ( \tau ) ^ { 2 } \right] \right) \\& = - 2 \mathbb{E} \left[ g ( \tau ) ^ { 2 } r ( \tau ) \right] + 2 b \mathbb{E} \left[ g ( \tau ) ^ { 2 } \right]\end{aligned} and solve the optimal value for $$b$$ $b=\frac{\mathbb{E}\left[g(\tau)^2 r(\tau)\right]}{\mathbb{E}\left[g(\tau)^2\right]}$ ## Proof 4 \begin{aligned}\nabla_{\theta'}J(\theta') &= \mathbb{E}_{\tau\sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_{\theta}(\tau)}\nabla_{\theta'}\log \pi_{\theta'}(\tau)r(\tau)\right] \\&= \mathbb{E} _ { \tau \sim \pi _ { \theta } ( \tau ) } \left[ \left( \prod _ { t = 1 } ^ { T } \frac { \pi _ { \theta ^ { \prime } } ( \mathbf { a } _ { t } | \mathbf { s } _ { t } ) } { \pi _ { \theta } ( \mathbf { a } _ { t } | \mathbf { s } _ { t } ) } \right) \left( \sum _ { t = 1 } ^ { T } \nabla _ { \theta ^ { \prime } } \log \pi _ { \theta ^ { \prime } } ( \mathbf { a } _ { t } | \mathbf { s } _ { t } ) \right) \left( \sum _ { t = 1 } ^ { T } r \left( \mathbf { s } _ { t } , \mathbf { a } _ { t } \right) \right) \right]\end{aligned} By causality (future actions don't affect current weight), we can rewrite the equation to be $\nabla_{\theta'}J(\theta')= \mathbb{E}_{\tau\sim \pi_\theta(\tau)}\left[\sum_{t=1}^T\nabla_{\theta'}\log \pi_{\theta'}(\mathbf{a}_t|\mathbf{s}_t) \left(\prod_{t'=1}^t\frac{\pi_{\theta'}(\mathbf{a}_{t'}|\mathbf{s}_{t'})}{\pi_{\theta}(\mathbf{a}_{t'}|\mathbf{s}_{t'})}\right) \left(\sum_{t'=t}^T r(\mathbf{s}_{t'},\mathbf{a}_{t'}) \left(\prod_{t''=t}^{t'}\frac{\pi_{\theta'}(\mathbf{a}_{t''}|\mathbf{s}_{t''})}{\pi_{\theta}(\mathbf{a}_{t''}|\mathbf{s}_{t''})}\right) \right)\right]$ ==Ignore the second $$\prod$$ and we can get the policy iteration algorithm.== ]]> <p>Berkeley <a href="http://rail.eecs.berkeley.edu/deeprlcourse/">CS 285</a> Review</p> <p>My <a href="https://github.com/silencial/DeepRL">solution</a> to the homework</p> <p><a href="https://silencial.github.io/deep-reinforcement-learning-2/">Deep Reinforcement Learning (Part 2)</a></p> <p><a href="https://silencial.github.io/deep-reinforcement-learning-3/">Deep Reinforcement Learning (Part 3)</a></p> Nonlinear System https://silencial.github.io/nonlinear-system/ 2019-10-12T00:00:00.000Z 2019-12-08T00:00:00.000Z ME 583 Review. Based on Nonlinear Systems (Hassan K. Khalil) book. # Introduction Dynamical system can be represented by a finite number of coupled ODEs: $\dot{x} = f(t, x, u) \\y = h(t, x, u)$ When $$f$$ does not depend explicitly on $$u$$, the state equation becomes: $\dot{x} = f(t, x)$ Furthermore, the system is said to be autonomous or time invariant if $$f$$ does not depend explicitly on $$t$$: $\dot{x} = f(x)$ Compared to Linear systems, nonlinear systems have some unique phenomenons: 1. Finite escape time: • Linear system: state can only go to infinity in infinite time. • Nonlinear system: can go in finite time. 2. Multiple isolated equilibriums: • Linear system: can have only one isolated equi. point. • Nonlinear system: state may converge to one of several steady-state operating points, depending on the initial state of the system. 3. Limit cycles: • Linear system: must have a pair of eigenvalues on the imaginary axis to oscillate, which is nonrobust (unstable to perturbations). • Nonlinear system: can go into an oscillation of fixed amplitude and frequency, irrespective of the initial state. 4. Subharmonic, harmonic, or almost-periodic oscillations: • Linear system: produces an output of the same frequency under a periodic input. • Nonlinear system: can oscillate with frequencies that are submultiples or multiples. It may even generate an almost-periodic oscillation. 5. Chaos: • Linear system: deterministic steady-state behavior. • Nonlinear system: can have more complicated steady-state behavior that is not equilibrium, periodic oscillation, or almost-periodic oscillation. 6. Multiple modes of behavior: Nonlinear system may exhibit multiple models of behavior based on type of excitations. When the property of excitation change smoothly, the behavior mode can have discontinuous jump. # Second-Order System Consider a second-order autonomous system: $\begin{equation}\begin{split}\dot{x}_1 = f_1(x_1, x_2) \\\dot{x}_2 = f_2(x_1, x_2)\end{split}\label{second}\end{equation}$ ## Qualitative Behavior of Linear Systems For a second-order LTI system, $$\eqref{second}$$ becomes: $\dot{x} = Ax$ and the solution given $$x(0) = x_0$$ is $x(t) = M e^{J t} M^{-1} x_0$ where $$J$$ is the Jordan form of $$A$$ and $$M$$ is a real nonsingular matrix s.t. $$M^{-1} A M = J$$. $$J$$ can have three forms depending on the eigenvalues of $$A$$. ### Case 1 $$\lambda_1 \ne \lambda_2 \ne 0$$, $$J = \begin{bmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{bmatrix}$$ The change of coordinates $$z = M^{-1} x$$ transforms the system into two decoupled first-order DE: $\dot{z}_1 = \lambda_1 z_1 \\\dot{z}_2 = \lambda_2 z_2$ 1. $$\lambda_1 < 0, \lambda_2 < 0$$: The equi. point $$x=0$$ is stable. The phase portrait in $$z_1$$-$$z_2$$ plane 2. $$\lambda_1 > 0, \lambda_2 > 0$$: $$x=0$$ is unstable. Change the arrow direction in the above image to get the phase portrait. 3. $$\lambda_1 > 0 > \lambda_2$$: $$x=0$$ is a saddle point. ### Case 2 $$\lambda_1 = \lambda_2 \in \mathbb{R}$$, $$J = \begin{bmatrix} \lambda_1 & k \\ 0 & \lambda_2 \end{bmatrix}$$, $$k$$ is either 0 or 1 The phase portrait for $$k=0$$ and $$k=1$$ respectively: ### Case 3 $$\lambda_{1,2} = \alpha \pm j \beta$$, $$J = \begin{bmatrix} \alpha & -\beta \\ \beta & \alpha \end{bmatrix}$$ The phase portrait for $$\alpha < 0$$,> 0 , $$\alpha = 0$$ respectively: $$x=0$$ is referred as a stable focus if $$\alpha < 0$$, unstable focus if $$\alpha > 0$$, and center if $$\alpha = 0$$ ### Case 4 $$\lambda_1 \lambda_2 = 0$$ $$A$$ has a nontrivial null space and the system has a equilibrium subspace. ## Periodic Orbits Consider the second-order autonomous system $\begin{equation}\dot{x} = f(x)\label{sys}\end{equation}$ where $$f(x)$$ is cont. diff.. Poincare-Bendixson Criterion: Consider system $$\eqref{sys}$$ and let $$M$$ be a closed bounded subset of the plane s.t. • $$M$$ contains no equi. points, or contains only one equi. point s.t. the Jacobian matrix $$\partial f / \partial x$$ at this point has eigenvalues with positive real parts. • Every trajectory starting in $$M$$ stays in $$M$$ for all future time Then, $$M$$ contains a periodic orbit of $$\eqref{sys}$$. Bendixson Criterion: If, on a simply connected region $$D$$ of the plane, $$\nabla \cdot f$$ is not identically zero and does not change sign, then system $$\eqref{sys}$$ has no periodic orbits lying entirely in $$D$$ # Fundamental Properties ## Definition • Connected set is a set that can not be partitioned into two open nonempty sets • Compact set: closed and bounded • Domain: open and connected set • Locally Lipschitz (LL) on a domain $$D \in \mathbb{R}^n$$ if each point of $$D$$ has a neighborhood $$D_0$$ s.t. $$f$$ satisfies the Lipschitz condition for all points in $$D_0$$ with some Lipschitz const. $$L_0$$. • Globally Lipschitz (GL): Lipschitz on $$\mathbb{R}^n$$ with a uniform Lipschitz const.. ## Existence and Uniqueness Theorem 1 (Local Existence and Uniqueness): Let $$f(t, x)$$ be piecewise cont. in $$t$$ and satisfy the Lipschitz condition $\| f(t, x) - f(t, y) \| \le L \| x - y \|$ $$\forall x, y\in B = \{ x\in \mathbb{R}^n \mid \| x - x_0 \| \le r \}$$, $$\forall t \in [t_0, t_1]$$. Then there exists some $$\delta > 0$$ s.t. the state equation $$\dot{x} = f(t, x)$$ with $$x(t_0) = x_0$$ has a unique solution over $$[t_0, t_0 + \delta]$$. We have some lemmas below to prove Lipschitz condition by $$\partial f / \partial x$$. Lemma 1: Let $$f : [a, b] \times D \rightarrow \mathbb{R}^m$$ be cont. on $$D\subseteq \mathbb{R}^n$$. Suppose that $$[\partial f / \partial x]$$ exists and is cont. on $$[a,b] \times D$$. If, for a convex set $$W \subseteq D$$, there is a const. $$L \ge 0$$ s.t. $\left\| \frac { \partial f } { \partial x } ( t , x ) \right\| \leq L$ on $$[a,b] \times W$$, then $\left\| f ( t , x ) - f ( t , y ) \right\| \leq L \| x - y \|$ for all $$t \in [a,b]$$, $$x\in W$$, and $$y\in W$$. Lemma 2: If $$f(t,x)$$ and $$[\partial f / \partial x](t,x)$$ are cont. on $$[a,b]\times D$$, for $$D\in \mathbb{R}^n$$, then $$f$$ is LL in $$x$$ on $$[a,b] \times D$$. Lemma 3: If $$f(t,x)$$ and $$[\partial f / \partial x](t,x)$$ are cont. on $$[a,b] \times \mathbb{R}^n$$, then $$f$$ is GL in $$x$$ on $$[a,b]\times \mathbb{R}^n$$ iff $$[\partial f / \partial x]$$ is uniformly bounded (UB) on $$[a,b]\times \mathbb{R}^n$$. Theorem 2 (Global Existence and Uniqueness): Let $$f(t,x)$$ be piecewise cont. in $$t$$ and satisfy $\| f(t, x) - f(t, y) \| \le L \| x - y \|$ $$\forall x, y \in \mathbb{R}^n$$, $$\forall t \in [t_0, t_1]$$. Then, the state equation $$\dot{x} = f(t, x)$$, with $$x(t_0) = x_0$$, has a unique solution over $$[t_0, t_1]$$ Theorem 3: Global existence and uniqueness theorem that requires $$f$$ to be only LL: Let $$f(t,x)$$ be piecewise cont. in $$t$$ and LL in $$x$$ for all $$t \ge t_0$$ and all $$x$$ in $$D\subset \mathbb{R}^n$$. Let $$W$$ be a compact subset of $$D$$, $$x_0 \in W$$, and suppose every solution of $\begin{equation}\dot{x} = f(t,x) \qquad x(t_0) = x_0\label{init}\end{equation}$ lies entirely in $$W$$. Then, there is a unique solution that is defined for all $$t \ge t_0$$. ## Continuous Dependence on Initial Conditions and Parameters The solution of $$\eqref{init}$$ must depend cont. on the initial state $$x_0$$, the initial time $$t_0$$, and the right-hand side function $$f(t, x)$$. Theorem 4: Let $$f(t, x)$$ be piecewise cont. in $$t$$ and Lipschitz in $$x$$ on $$[t_0, t_1] \times W$$ with a Lipschitz const. $$L$$, where $$W \subset \mathbb{R}^n$$ is an open connected set. Let $$y(t)$$ and $$z(t)$$ be the solution of $\dot{y} = f(t, y), \qquad y(t_0) = y_0 \\\dot{z} = f(t, z) + g(t, z), \qquad z(t_0) = z_0$ s.t. $$y(t), z(t) \in W$$ for all $$t \in [t_0, t)_1]$$. Suppose that $\|g(t, x)\| \le \mu,\quad \forall (t, x) \in [t_0, t_1] \times W$ for some $$\mu >0$$, Then $\| y ( t ) - z ( t ) \| \leq \left\| y _{ 0 } - z_ { 0 } \right\| e^{L \left( t - t _{ 0 } \right)} + \frac { \mu } { L } \left( e^{ L \left( t - t_ { 0 } \right) } - 1 \right)$ $$\forall t \in [t_0, t_1]$$. And the next theorem shows the continuity of solutions in terms of initial states and parameters. Theorem 5: Let $$f(t, x, \lambda)$$ be cont. in $$(t, x, \lambda)$$ and LL in $$x$$ (uniformly in $$t$$ and $$\lambda$$) on $$[t_0, t_1] \times D \times \{ \|\lambda - \lambda_0 \| \le c \}$$, where $$D \subset \mathbb{R}^n$$ is an open connected set. Let $$y(t, \lambda_0)$$ be a solution of $$\dot{x} = f(t, x, \lambda_0)$$ with $$y(t_0, \lambda_0) = y_0 \in D$$. Suppose $$y(t, \lambda_0)$$ is defined and belongs to $$D$$ for all $$t \in [t_0, t_1]$$. Then, given $$\epsilon > 0$$, there is $$\delta >0$$ s.t. if $\| z_0 - y_0 \| < \delta, \qquad \| \lambda - \lambda_0 \| < \delta$ then there is a unique solution $$z(t, \lambda)$$ of $$\dot{x} = f(t, x, \lambda)$$ defined on $$[t_0, t_1]$$, with $$z(t_0, \lambda) = z_0$$, and $$z(t, \lambda)$$ satisfies $\| z(t, \lambda) - y(t, \lambda_0) \| < \epsilon, \quad \forall t\in [t_0, t_1]$ ## Sensitivity Equations Suppose that $$f(t, x, \lambda)$$ is cont. in $$(t, x, \lambda)$$ and has cont. first partial derivatives w.r.t. $$x$$ and $$\lambda$$ for all $$(t, x, \lambda) \in [t_0, t_1] \times \mathbb{R}^n \times \mathbb{R}^p$$. Let $$\lambda_0$$ be a nominal value of $$\lambda$$, and suppose that the nominal state equation $\dot{x} = f(t,x,\lambda_0), \qquad x(t_0) = x_0$ has a unique solution $$x(t, \lambda_0)$$ over $$[t_0, t_1]$$. We know that for all $$\lambda$$ sufficiently close to $$\lambda_0$$, the state equation $\dot{x} = f(t,x,\lambda), \qquad x(t_0) = x_0$ has a unique solution $$x(t, \lambda)$$ over $$[t_0, t_1]$$ that is close to the nominal solution $$x(t, \lambda_0)$$. The cont. diff. of $$f$$ w.r.t. $$x$$ and $$\lambda$$ implies the additional property that the solution $$x(t, \lambda)$$ is diff. w.r.t. $$\lambda$$ near $$\lambda_0$$. $x ( t , \lambda ) = x _{ 0 } + \int_ { t _{ 0 } } ^ { t } f ( s , x ( s , \lambda ) , \lambda ) d s$ Take partial derivatives w.r.t. $$\lambda$$ yields $x_ { \lambda } ( t , \lambda ) = \int _{ t_ { 0 } } ^ { t } \left[ \frac { \partial f } { \partial x } ( s , x ( s , \lambda ) , \lambda ) x _ { \lambda } ( s , \lambda ) + \frac { \partial f } { \partial \lambda } ( s , x ( s , \lambda ) , \lambda ) \right] d s$ where $$x_\lambda (t_0, \lambda) = 0$$. Differentiating w.r.t. $$t$$ yields \begin{aligned}\frac{\partial}{\partial t} x_\lambda (t, \lambda) &= \left.\frac{\partial f(t, x, \lambda)}{\partial x} \right|_{x = x(t, \lambda)} x_\lambda (t, \lambda) + \left.\frac{\partial f(t, x, \lambda)}{\partial \lambda} \right|_{x = x(t, \lambda)} \\&= A(t, \lambda) x_\lambda (t, \lambda) + B(t, \lambda)\end{aligned} For $$\lambda$$ sufficiently close to $$\lambda_0$$, the matrix $$A(t, \lambda)$$ and $$B(t, \lambda)$$ are defined on $$[t_0, t_1]$$. Hence, $$x_\lambda(t, \lambda)$$ is defined on the same interval. Let $$S(t) = x_\lambda(t, \lambda_0)$$, then $$S(t)$$ is the unique solution of the equation $\begin{equation}\dot{S}(t) = A(t, \lambda_0) S(t) + B(t, \lambda_0), \qquad S(t_0) = 0\label{sens}\end{equation}$ $$S(t)$$ is called the sensitivity function, and $$\eqref{sens}$$ is called the sensitivity equation. # Lyapunov Stability ## Autonomous System Consider the autonomous system $\begin{equation}\dot{x} = f(x)\label{as}\end{equation}$ where $$f:D\rightarrow \mathbb{R}^n$$ is a LL map from $$D\subset \mathbb{R}^n$$ into $$\mathbb{R}^n$$. $$\bar{x}$$ is an equilibrium point of the system if $$f(\bar{x}) = 0$$. Without loss of generality, we can assume the equi. point is at the origin. Definition 1: The equi. point $$x=0$$ of $$\eqref{as}$$ is • stable if, for each $$\epsilon >0$$, there is $$\delta > 0$$ s.t. $\| x(0) \| < \delta \Rightarrow \| x(t) \| < \epsilon, \quad\forall t \ge 0$ • unstable if it is not stable • asymptotically stable (AS) if it is stable and $$\delta$$ can be chosen s.t. $\| x(0) \| < \delta \Rightarrow \lim_{t\to \infty} x(t) = 0$ Theorem 1: Let $$x=0$$ be an equi. point of $$\eqref{as}$$ and $$D\subset \mathbb{R}^n$$ be a domain containing $$x=0$$. Let $$V : D \rightarrow \mathbb{R}$$ be a cont. diff. function s.t. \begin{aligned}V(0) &= 0 \\V(x) &> 0 \quad \forall x \in D -\{0\} \\\dot{V}(x) &\le 0 \quad \forall x \in D\end{aligned} which is called a Lyapunov function. Then, $$x=0$$ is stable. Moreover, if $\dot{V}(x) < 0 \quad \forall x \in D - \{0\}$ then $$x=0$$ is AS. Theorem 2: Let $$x=0$$ be an equi. point of $$\eqref{as}$$. Let $$V : \mathbb{R}^n \rightarrow \mathbb{R}$$ be a cont. diff. function s.t. $V(0) = 0 \\V(x) > 0 \quad \forall x \ne 0 \\\| x \| \rightarrow \infty \Rightarrow V(x) \rightarrow \infty \\\dot{V}(x) < 0 \quad \forall x \ne 0$ then $$x=0$$ is globally asymptotically stable (GAS). A function $$V(x)$$ satisfying the condition $$V(x) \rightarrow \infty$$ as $$\| x \| \rightarrow \infty$$ is said to be radially unbounded (RU). If the origin is a GAS, then it must be the unique equi. point. Theorem 3: Let $$x=0$$ be an equi. point of $$\eqref{as}$$. Let $$V: D \rightarrow \mathbb{R}$$ be a cont. diff. function s.t. $$V(0) = 0$$ and $$V(x_0) > 0$$ for some $$x_0$$ with arbitrarily small $$\| x_0 \|$$. Define a set $$U=\{x \in B_r \mid V(x) > 0\}$$ and suppose $$\dot{V}(x) > 0$$ in $$U$$. Then $$x=0$$ is unstable. ## Invariance Principle Definition 2: A point $$p$$ is said to be a positive limit point of $$x(t)$$ if there is a sequence $$\{ t_n \}$$, with $$t_n \to \infty$$ as $$n\to \infty$$, s.t. $$x(t_n) \to p$$ as $$n \to \infty$$. Definition 3: A set $$M$$ is said to be an invariant set w.r.t. $$\eqref{as}$$ if $x(0) \in M \Rightarrow x(t) \in M, \quad \forall t \in \mathbb{R}$ it is positively invariant set if $x(0) \in M \Rightarrow x(t) \in M, \quad \forall t \ge 0$ Lemma 1: If a solution $$x(t)$$ of $$\eqref{as}$$ is bounded and belongs to $$D$$ for $$t \ge 0$$, then its positive limit set $$L^+$$ is a nonempty, compact, invariant set. Moreover, $$x(t)$$ approaches $$L^+$$ as $$t \to \infty$$. Theorem 4: Let $$\Omega \subset D$$ be a compact set that is PI w.r.t. $$\eqref{as}$$. Let $$V : D \rightarrow \mathbb{R}$$ be a cont. diff. function s.t. $$\dot{V}(x) \le 0$$ in $$\Omega$$. Let $$E=\{x \in \Omega \mid \dot{V}(x)=0 \}$$. Let $$M$$ be the largest invariant set in $$E$$. Then every solution starting in $$\Omega$$ approaches $$M$$ as $$t \to \infty$$. Corollary 1: Let $$x=0$$ be an equi. point of $$\eqref{as}$$. Let $$V : D \rightarrow \mathbb{R}$$ be a cont. diff. PD function on a domain $$D$$ containing the origin, s.t. $$\dot{V}(x) \le 0$$ in $$D$$. Let $$S = \{ x \in D \mid \dot{V}(x) = 0 \}$$ and suppose that no solution can stay identically in $$S$$, other than the trivial solution $$x(t) \equiv 0$$. Then the origin is AS. Corollary 2: Let $$x=0$$ be an equi. point of $$\eqref{as}$$. Let $$V : \mathbb{R}^n \rightarrow \mathbb{R}$$ be a cont. diff., RU, PD function s.t. $$\dot{V}(x) \le 0$$ for all $$x \in \mathbb{R}^n$$. Let $$S = \{ x \in \mathbb{R}^n \mid \dot{V}(x) = 0 \}$$ and suppose that no solution can stay identically in $$S$$, other than the trivial solution $$x(t) \equiv 0$$. Then the origin is GAS. ## LTI Systems and Linearization Theorem 5: The equi. point $$x = 0$$ of $$\dot{x} = Ax$$ is stable iff all eigenvalues of $$A$$ satisfy $$\operatorname{Re}\lambda_i \le 0$$ and for every eigenvalues with $$\operatorname{Re}\lambda_i = 0$$ and algebraic multiplicity $$q_i \ge 2$$, $$\operatorname{rank}(A - \lambda_i I) = n - q_i$$, where $$n$$ is the dimension of $$x$$. The equi. point $$x=0$$ is (globally) AS iff all eigenvalues of $$A$$ satisfy $$\operatorname{Re}\lambda_i < 0$$ Theorem 6: A matrix $$A$$ is Hurwitz ($$\operatorname{Re}\lambda_i \le 0$$) iff for any given PSD $$Q$$ there exists a PSD that satisfies the Lyapunov equation: $PA + A^T P = -Q$ Moreover, if $$A$$ is Hurwitz, then $$P$$ is the unique solution. Theorem 7 (Lyapunov's indirect method): Let $$x=0$$ be an equi. point for $$\eqref{as}$$, where $$f:D\rightarrow \mathbb{R}^n$$ is cont. diff. and $$D$$ is a neighborhood of the origin. Let $$A = \left. \frac{\partial f}{\partial x} \right|_{x=0}$$, then • The origin is AS if $$\operatorname{Re}\lambda_i < 0$$ for all eigenvalues of $$A$$. • The origin is unstable if $$\operatorname{Re}\lambda_i > 0$$ for at least one eigenvalue of $$A$$. ## Comparison Functions Definition 4: A cont. function $$\alpha: [0, a) \rightarrow [0, \infty)$$ is said to belong to $$\mathcal{K}$$ if it is strictly increasing and $$\alpha(0) = 0$$. It is said to belong to $$\mathcal{K}_\infty$$ if $$a = \infty$$ and $$\alpha(r) \rightarrow \infty$$ as $$r \rightarrow \infty$$. Definition 5: A cont. function $$\beta: [0, a) \times [0, \infty) \rightarrow [0, \infty)$$ is said to belong to $$\mathcal{KL}$$ if, for each fixed $$s$$, the mapping $$\beta(r, s)$$ belongs to $$\mathcal{K}$$ w.r.t. $$r$$ and, for each fixed $$r$$, the mapping $$\beta(r, s)$$ is decreasing w.r.t. $$s$$ and $$\beta(r, s) \rightarrow 0$$ as $$s \rightarrow \infty$$. Lemma 2: Let $$\alpha_1$$ and $$\alpha_2$$ be $$\mathcal{K}$$ functions on $$[0, a)$$, $$\alpha_3$$ and $$\alpha_4$$ be $$\mathcal{K}_\infty$$ functions, and $$\beta$$ be a $$\mathcal{KL}$$ function. Denote the inverse of $$\alpha_i$$ by $$\alpha_i^{-1}$$. Then • $$\alpha_1^{-1}$$ is defined on $$[0, \alpha_1(a))$$ and belongs to $$\mathcal{K}$$ • $$\alpha_3^{-1}$$ is defined on $$[0, \infty)$$ and belongs to $$\mathcal{K}_\infty$$ • $$\alpha_1 \circ \alpha_2$$ belongs to $$\mathcal{K}$$ • $$\alpha_3 \circ \alpha_4$$ belongs to $$\mathcal{K}_\infty$$ • $$\sigma(r, s) = \alpha_1( \beta(\alpha_2(r), s))$$ belongs to $$\mathcal{KL}$$ Lemma 3: Let $$V : D \rightarrow \mathbb{R}$$ be a cont. PD function defined on a domain $$D \subset \mathbb{R}^n$$ that contains the origin. Let $$B_r \subset D$$ for some $$r > 0$$. Then, there exist $$\mathcal{K}$$ functions $$\alpha_1$$ and $$\alpha_2$$, defined on $$[0, r]$$, s.t. $\alpha_1(\|x\|) \le V(x) \le \alpha_2(\|x\|)$ for all $$x\in B_r$$. If $$D = \mathbb{R}^n$$, the functions $$\alpha_1$$ and $$\alpha_2$$ will be defined on $$[0, \infty)$$ and the foregoing inequality will hold for all $$x\in \mathbb{R}^n$$. Moreover, if $$V(x)$$ is RU, then $$\alpha_1$$ and $$\alpha_2$$ can be chosen to belong to $$\mathcal{K}_\infty$$ Lemma 4: Consider the scalar autonomous DE $\dot{y} = -\alpha(y), \qquad y(t_0) = y_0$ where $$\alpha$$ is a LL $$\mathcal{K}$$ function defined on $$[0,a)$$. For all $$0 \le y_0 \le a$$, this equation has a unique solution $$y(t)$$ defined for all $$t \ge t_0$$. Moreover, $y(t) = \sigma(y_0, t_0)$ where $$\sigma$$ is a $$\mathcal{KL}$$ function defined on $$[0, a) \times [0, \infty)$$. ## Nonautonomous System Consider the nonautonomous system $\begin{equation}\dot{x} = f(t, x)\label{nas}\end{equation}$ where $$f: [0, \infty) \times D \rightarrow \mathbb{R}^n$$ is piecewise cont. in $$t$$ and LL in $$x$$ on $$[0, \infty) \times D$$, and $$D \subset \mathbb{R}^n$$ is a domain that contains the origin. The origin is an equi. point at $$t=0$$ if $f(t, 0) = 0, \quad \forall t \ge 0$ Definition 6: The equi. point $$x=0$$ of $$\eqref{nas}$$ is • stable if, for each $$\epsilon >0$$, there is $$\delta = \delta(\epsilon, t_0) > 0$$ s.t. $\begin{equation}\left\| x \left( t_ { 0 } \right) \right\| < \delta \Rightarrow \| x ( t ) \| < \varepsilon , \quad \forall t \geq t _ { 0 } \ge 0\label{nastable}\end{equation}$ • uniformly stable (US) if, for each $$\epsilon >0$$, there is $$\delta = \delta(\epsilon) > 0$$, independent of $$t_0$$ s.t. $$\eqref{nastable}$$ is satisfied. • unstable if it is not stable • AS if it is stable and there is a const. $$c=c(t_0) > 0$$ s.t. $\begin{equation}\| x(t_0) \| < c \Rightarrow \lim_{t\to \infty} x(t) = 0\label{naas}\end{equation}$ • uniformly asymptotically stable (UAS) if it is US and there exists $$c>0$$, independent of $$t_0$$ s.t. $$\eqref{naas}$$ is satisfied. • globally uniformly asymptotically stable (GUAS) if it is US, $$\delta(\epsilon)$$ can be chosen to satisfy $$\lim_{\epsilon \to \infty} \delta(\epsilon) = \infty$$, and, for each pair of $$\eta > 0$$ and $$c > 0$$, there is $$T = T(\eta, c) > 0$$ s.t. $\| x ( t ) \| < \eta , \quad \forall t \geq t_ { 0 } + T ( \eta , c ) , \quad \forall \left\| x \left( t _ { 0 } \right) \right\| < c$ Definition 7: The equi. point $$x=0$$ of $$\eqref{nas}$$ is • US iff there exist a $$\mathcal{K}$$ function $$\alpha$$ and a const. $$c>0$$, independent of $$t_0$$, s.t. $\| x ( t ) \| \leq \alpha \left( \left\| x \left( t_ { 0 } \right) \right\| \right) , \quad \forall t \geq t _{ 0 } \geq 0 ,\ \forall \left\| x \left( t_ { 0 } \right) \right\| < c$ • UAS iff there exist a $$\mathcal{KL}$$ function $$\beta$$ and a const. $$c>0$$, independent of $$t_0$$, s.t. $\begin{equation}\| x ( t ) \| \leq \beta \left( \left\| x \left( t_ { 0 } \right) \right\| , t - t _{ 0 } \right) , \quad \forall t \geq t_ { 0 } \geq 0 ,\ \forall \left\| x \left( t _ { 0 } \right) \right\| < c\label{uaskl}\end{equation}$ • GUAS iff $$\eqref{uaskl}$$ is satisfied for any initial state $$x(t_0)$$ Definition 8: The equi. point $$x=0$$ of $$\eqref{nas}$$ is exponentially stable (ES) if there exist const. $$c > 0$$, $$k > 0$$ and $$\lambda > 0$$ s.t. $\| x ( t ) \| \leq k \left\| x \left( t _{ 0 } \right) \right\| e ^ { - \lambda \left( t - t_ { 0 } \right) } , \quad \forall \left\| x \left( t _{ 0 } \right) \right\| < c$ and globally exponentially stable (GES) if it holds for any initial state $$x(t_0)$$ Theorem 8: Let $$x=0$$ be an equi. point of $$\eqref{nas}$$ and $$D\subset \mathbb{R}^n$$ be a domain containing $$x=0$$. Let $$V : [0, \infty) \times D \rightarrow \mathbb{R}$$ be a cont. diff. function s.t. $W_1(x) \le V(t, x) \le W_2(x) \\\frac{\partial V}{\partial t} + \frac{\partial V}{\partial x}f(t, x) \le 0$ $$\forall t \ge 0$$ and $$\forall x \in D$$, where $$W_1(x)$$ and $$W_2(x)$$ are cont. PD functions on $$D$$. Then, $$x=0$$ is US. Theorem 9: Suppose the assumptions of Theorem 8 are satisfied with inequality strengthened to $\frac{\partial V}{\partial t} + \frac{\partial V}{\partial x}f(t, x) \le -W_3(x)$ $$\forall t \ge 0$$ and $$\forall x \in D$$, where $$W_3(x)$$ is a cont. PD function on $$D$$. Then, $$x=0$$ is UAS. Moreover, if $$r$$ and $$c$$ are chosen s.t. $$B_r = \{ \|x\| \le r \} \subset D$$ and $$c < \min_{\|x\| = r} W_1(x)$$, then every trajectory starting in $$\{ x \in B_r \mid W_2(x) \le c\}$$ satisfies $\| x ( t ) \| \leq \beta \left( \left\| x \left( t_ { 0 } \right) \right\| , t - t _{ 0 } \right) , \quad \forall t \geq t_ { 0 } \geq 0$ for some class $$\mathcal{KL}$$ function $$\beta$$. Finally, if $$D = \mathbb{R}^n$$ and $$W_1(x)$$ is RU, then $$x=0$$ is GUAS. Theorem 10: Let $$x=0$$ be an equi. point of $$\eqref{nas}$$ and $$D\subset \mathbb{R}^n$$ be a domain containing $$x=0$$. Let $$V : [0, \infty) \times D \rightarrow \mathbb{R}$$ be a cont. diff. function s.t. $\begin{equation}{ k _{ 1 } \| x \| ^ { a } \leq V ( t , x ) \leq k_ { 2 } \| x \| ^ { a } } \\{ \frac { \partial V } { \partial t } + \frac { \partial V } { \partial x } f ( t , x ) \leq - k _{ 3 } \| x \| ^ { a } }\label{es}\end{equation}$ $$\forall t \ge 0$$ and $$\forall x \in D$$, where $$k_1, k_2, k_3, a$$ are positive const.. Then $$x=0$$ is ES. If the assumptions hold globally, then $$x=0$$ is GES. ## LTV Systems and Linearization The stability for the LTV system $\begin{equation}\dot{x}(t) = A(t) x\label{ltv}\end{equation}$ can be completely characterized in terms of the state transition matrix $x(t) = \Phi(t, t_0) x(t_0)$ Theorem 11: The equi. point $$x=0$$ of $$\eqref{ltv}$$ is (globally) UAS iff the state transition matrix satisfies $\left\| \Phi \left( t , t _{ 0 } \right) \right\| \leq k e ^ { - \lambda \left( t - t_ { 0 } \right) } , \quad \forall t \geq t _ { 0 } \geq 0$ for some positive const. $$k$$ and $$\lambda$$ Theorem 12: Let $$x=0$$ be the ES equi. point of $$\eqref{ltv}$$. Suppose $$A(t)$$ is cont. and bounded. Let $$Q(t)$$ be a cont., bounded, PD, symmetric matrix. Then, there is a cont. diff. bounded, PD, symmetric matrix $$P(t)$$ that satisfies $-\dot { P } ( t ) = P ( t ) A ( t ) + A ^ { T } ( t ) P ( t ) + Q ( t )$ and $$V(t,x) = x^T P(t) x$$ is a Lyapunov function for the system that satisfies $$\eqref{es}$$ Theorem 13: Let $$x=0$$ be an equi. point for $$\eqref{nas}$$, where $$f: [0, \infty) \times D \rightarrow \mathbb{R}^n$$ is cont. diff.,D = { x |x|_2 < r } $, and the Jacobian matrix$ f / x $is bounded and Lipschitz on $$D$$, uniformly in$ t $. Let $A(t) = \left. \frac{\partial f}{\partial x} (t, x) \right|_{x=0}$ Then, the origin is ES for the nonlinear system iff it is an ES for the linear system $$\dot{x} = A(t) x$$ ## Converse Theorems Theorem 14: Let $$x=0$$ be an equi. point for $$\eqref{nas}$$, where $$f: [0, \infty) \times D \rightarrow \mathbb{R}^n$$ is cont. diff.,$D = { x |x| < r } $, and the Jacobian matrix$ f / x $is bounded on $$D$$, uniformly in $$t$$. Let $$k, \lambda, r_0$$ be positive const. with $$r_0 < r/ k$$. Let$D_0 = { x |x| < r_0 } $. Assume that the trajectories of the system satisfy $\| x ( t ) \| \leq k \left\| x \left( t _{ 0 } \right) \right\| e ^ { - \lambda \left( t - t_ { 0 } \right) } , \quad \forall x \left( t _{ 0 } \right) \in D_ { 0 } , \ \forall t \geq t _{ 0 } \geq 0$ Then, there is a function $$V: [0, \infty) \times D_0 \rightarrow \mathbb{R}$$ that satisfies the inequalities $c _{ 1 } \| x \| ^ { 2 } \leq V ( t , x ) \leq c_ { 2 } \| x \| ^ { 2 } \\\frac { \partial V } { \partial t } + \frac { \partial V } { \partial x } f ( t , x ) \leq - c _{ 3 } \| x \| ^ { 2 } \\\left\| \frac { \partial V } { \partial x } \right\| \leq c_ { 4 } \| x \|$ for some positive const. $$c_1, c_2, c_3, c_4$$. Moreover, if $$r=\infty$$ and the origin is GES, then $$V(t,x)$$ is defined and satisfies the above inequalities on $$\mathbb{R}^n$$. Furthermore, if the system is autonomous, $$V$$ can be chosen independent of $$t$$ Theorem 15: Let $$x=0$$ be an AS equil. point for $$\eqref{nas}$$ where $$f: [0, \infty) \times D \rightarrow \mathbb{R}^n$$ is cont. diff.,$D = { x |x| < r } $, and the Jacobian matrix$ f / x is bounded on $$D$$, uniformly in $$t$$. Let $$\beta$$ be a $$\mathcal{KL}$$ function and $$r_0$$ be a positive const. s.t. $$\beta(r_0, 0) < r$$. Let $$D_0 = \{x\in \mathbb{R}^n \mid \|x\| < r_0 \}$$. Assume that the trajectory of the system satisfies $\| x ( t ) \| \leq \beta \left( \left\| x \left( t _{ 0 } \right) \right\| , t - t_ { 0 } \right) , \quad \forall x \left( t _{ 0 } \right) \in D_ { 0 } ,\ \forall t \geq t _{ 0 } \geq 0$ Then, there is a cont. diff. function $$V: [0, \infty) \times D_0 \rightarrow \mathbb{R}$$ that satisfies $\alpha _{ 1 } ( \| x \| ) \leq V ( t , x ) \leq \alpha_ { 2 } ( \| x \| ) \\\frac { \partial V } { \partial t } + \frac { \partial V } { \partial x } f ( t , x ) \leq - \alpha _{ 3 } ( \| x \| ) \\\left\| \frac { \partial V } { \partial x } \right\| \leq \alpha_ { 4 } ( \| x \| )$ where $$\alpha_{1,2,3,4}$$ are $$\mathcal{K}$$ functions defined on $$[0, r_0]$$. If the system is autonomous, $$V$$ can be chosen independent of $$t$$ Theorem 16: Let $$x=0$$ be an AS equil. point for $$\eqref{as}$$ where $$f: D \rightarrow \mathbb{R}^n$$ is LL and $$D \subset \mathbb{R^n}$$ is a domain contains the origin. Let $$R_A \subset D$$ be the region of attraction of $$x=0$$. Then there is a smooth, PD function $$V(x)$$ and a cont. PD function $$W(x)$$, both defined for all $$x\in R_A$$, s.t. $\lim_{x \rightarrow \partial R_ { A }} V ( x ) \rightarrow \infty \\\frac { \partial V } { \partial x } f ( x ) \leq - W ( x ) , \quad \forall x \in R _{ A }$ and for any $$c>0$$, $$\{V(x) \le c\}$$ is a compact subset of $$R_A$$. When $$R_A = \mathbb{R}^n$$, $$V(x)$$ is RU. ## Boundedness Definition 9: The solution of $$\eqref{nas}$$ are • uniformly bounded (UB) if there exists a positive const. $$c$$, independent of $$t_0 \ge 0$$, and for every $$a \in (0, c)$$, there is $$\beta = \beta(a) > 0$$, independent of $$t_0$$, s.t. $$\left\| x \left( t _{ 0 } \right) \right\| \leq a \Rightarrow \| x ( t ) \| \leq \beta , \quad \forall t \geq t_ { 0 }$$ • globally uniformly bounded (GUB) if UB holds for $$c=\infty$$ • uniformly ultimately bounded (UUB) with ultimate bound $$b$$ if there exist positive const. $$b$$ and $$c$$, independent of $$t_0 \ge 0$$, and for every $$a\in (0, c)$$, there is $$T = T(a, b) \ge 0$$, independent of $$t_0$$, s.t. $$\left\| x \left( t _{ 0 } \right) \right\| \leq a \Rightarrow \| x ( t ) \| \leq b , \quad \forall t \geq t_ { 0 } + T$$ • globally uniformly ultimately bounded (GUUB) if UUB holds for $$c=\infty$$ Theorem 17: Let $$D \subset \mathbb{R}^n$$ be a domain containing $$x=0$$ and $$V : [0, \infty) \times D \rightarrow \mathbb{R}$$ be a cont. diff. function s.t. $\alpha _{ 1 } ( \| x \| ) \leq V ( t , x ) \leq \alpha_ { 2 } ( \| x \| ) \\\frac { \partial V } { \partial t } + \frac { \partial V } { \partial x } f ( t , x ) \leq - W _{ 3 } ( x ) , \quad \forall \|x\| \ge \mu > 0$ $$\forall t \ge 0$$ and $$\forall x \in D$$, where $$\alpha_{1,2}$$ are $$\mathcal{K}$$ functions and $$W_3(x)$$ is a cont. PD function. Take $$r > 0$$ s.t. $$B_r \subset D$$ and suppose that $$\mu < \alpha_2^{-1}(\alpha_1(r))$$, then, there exists a $$\mathcal{KL}$$ function $$\beta$$ and $$\forall x(t_0)$$ that satisfies $$\|x(t_0)\| \le \alpha_2^{-1}(\alpha_1(r))$$, there is $$T \ge 0$$ (dependent on $$x(t_0)$$ and $$\mu$$) s.t. the solution of $$\eqref{nas}$$ satisfies $\| x ( t ) \| \leq \beta \left( \left\| x \left( t_ { 0 } \right) \right\| , t - t _{ 0 } \right) , \forall t_ { 0 } \leq t \leq t _{ 0 } + T \\\| x ( t ) \| \leq \alpha_ { 1 } ^ { - 1 } \left( \alpha _{ 2 } ( \mu ) \right) , \forall t \geq t_ { 0 } + T$ Moreover, if $$D = \mathbb{R}^n$$ and $$\alpha_1$$ belongs to $$\mathcal{K}_\infty$$, then it holds $$\forall x(t_0)$$ ## Input-to-State Stability Consider the system $\begin{equation}\dot{x} = f(t, x, u)\label{nasinput}\end{equation}$ where $$f: [0, \infty) \times \mathbb{R}^n \times \mathbb{R}^m \rightarrow \mathbb{R}^n$$ is piecewise cont. in $$t$$ and LL in $$x$$ and $$u$$. The input $$u(t)$$ is a piecewise cont., bounded function of $$t$$ for all $$t \ge 0$$. Suppose the unforced system $$\dot{x} = f(t, x, 0)$$ has a GUAS equi. point at the origin. What can we say about the system $$\eqref{nasinput}$$ in the presence of a bounded input $$u(t)$$. For the LTI system: $\dot{x} = Ax + Bu$ with a Hurwitz matrix $$A$$, we can write the solution as $x(t)=e^{\left(t-t_{0}\right) A} x\left(t_{0}\right)+\int_{t_{0}}^{t} e^{(t-\tau) A} B u(\tau) d \tau$ and use the bound $$\left\|e^{\left(t-t_{0}\right) A}\right\| \leq k e^{-\lambda\left(t-t_{0}\right)}$$ to estimate the solution by \begin{aligned}\|x(t)\| & \leq k e^{-\lambda\left(t-t_{0}\right)}\left\|x\left(t_{0}\right)\right\|+\int_{t_{0}}^{t} k e^{-\lambda(t-\tau)}\|B\|\|u(\tau)\| d \tau \\& \leq k e^{-\lambda\left(t-t_{0}\right)}\left\|x\left(t_{0}\right)\right\|+\frac{k\|B\|}{\lambda} \sup _{t_{0} \leq \tau \leq t}\|u(\tau)\|\end{aligned} This estimate shows that the zero-input response decays to zero exponentially fast, while the zero-state response is bounded for every bounded input (BIBS). Definition 10: The system $$\eqref{nasinput}$$ is said to be input-to-state stable (ISS) if there exist a $$\mathcal{KL}$$ function $$\beta$$ and a $$\mathcal{K}$$ function $$\gamma$$ s.t. for any initial state $$x(t_0)$$ and any bounded input $$u(t)$$, the solution $$x(t)$$ exists for all $$t \ge t_0$$ and satisfies $\|x(t)\| \leq \beta\left(\left\|x\left(t_{0}\right)\right\|, t-t_{0}\right)+\gamma\left(\sup _{t_{0} \leq \tau \leq t}\|u(\tau)\|\right)$ Theorem 18: Let $$V : [0, \infty) \times \mathbb{R}^n \rightarrow \mathbb{R}$$ be a cont. diff. function s.t. \begin{aligned}\alpha_{1}(\|x\|) & \leq V(t, x) \leq \alpha_{2}(\|x\|) \\\frac{\partial V}{\partial t}+\frac{\partial V}{\partial x} f(t, x, u) &\leq-W_{3}(x), \quad \forall\|x\| \geq \rho(\|u\|)>0\end{aligned} $$\forall (t,x,u) \in [0, \infty) \times \mathbb{R}^n \times \mathbb{R}^m$$, where $$\alpha_{1,2}$$ are $$\mathcal{K}_\infty$$ functions, $$\rho$$ is $$\mathcal{K}$$ function, and $$W_3(x)$$ is a cont. PD function on $$\mathbb{R}^n$$. Then, the system $$\eqref{nasinput}$$ is ISS with $$\gamma = \alpha_1^{-1} \circ \alpha_2 \circ \rho$$ Lemma 5: Suppose $$f(t,x,u)$$ is cont. diff. and GL in $$(x,u)$$, uniformly in $$t$$. If the unforced system $$\dot{x} = f(t, x, 0)$$ has a GES equi. point at the origin, then the system $$\eqref{nasinput}$$ is ISS. Consider the cascade system: \begin{align} dot{x}_{1}&=f_{1}\left(t, x_{1}, x_{2}\right) \label{cascade1} \\\dot{x}_{2}&=f_{2}\left(t, x_{2}\right) \label{cascade2}\end{align} where $$f_1 : [0, \infty) \times \mathbb{R}^{n_1} \times \mathbb{R}^{n_2} \rightarrow \mathbb{R}^{n_1}$$ and $$f_2 : [0, \infty) \times \mathbb{R}^{n_2} \rightarrow \mathbb{R}^{n_2}$$ are piecewise cont. in $$t$$ and LL in $$x$$. Suppose both $$\dot{x}_1 = f_1(t, x_1, 0)$$ and $$\dot{x}_{2}=f_{2}\left(t, x_{2}\right)$$ have GUAS equi. point at their respective origins. Lemma 6: If the system $$\eqref{cascade1}$$, with $$x_2$$ as input, is ISS and the origin of $$\eqref{cascade2}$$ is GUAS, then the origin of the cascade system is GUAS. # Input-Output Stability ## $$\mathcal{L}$$ Stability Consider a system with input-output relation represented by $y = H u$ Where $$H$$ is a mapping from $$u$$ to $$y$$, $$u : [0, \infty) \rightarrow \mathbb{R}^m$$. Define the space $$\mathcal{L}_p^m$$ for $$1\le p \le \infty$$ as the set of all piecewise cont. functions $$u : [0, \infty) \rightarrow \mathbb{R}^m$$ s.t. $\|u\|_{\mathcal{L}_{p}}=\left(\int_{0}^{\infty}\|u(t)\|^{p} d t\right)^{1 / p}<\infty$ Specifically, for $$p=2$$ and $$p=\infty$$, the space are defined respectively as \begin{aligned}\|u\|_{\mathcal{L}_{2}} &= \sqrt{\int_{0}^{\infty} u^{T}(t) u(t) d t}<\infty \\\| u \|_ { \mathcal { L } _{ \infty } } &= \sup_ { t \geq 0 } \| u ( t ) \| < \infty\end{aligned} Define the extended space $$\mathcal{L}_e^m$$ as $\mathcal{L}_{e}^{m}=\left\{u | u_{\tau} \in \mathcal{L}^{m}, \forall \tau \in[0, \infty)\right\}$ where $$u_\tau$$ is a truncation of $$u$$ defined by $u_{\tau}(t)=\left\{\begin{array}{cc}{u(t),} & {0 \leq t \leq \tau} \\ {0,} & {t>\tau}\end{array}\right.$ Definition 1: A mapping $$H : \mathcal{L}_e^m \rightarrow \mathcal{L}_e^q$$ is $$\mathcal{L}$$ stable if there exist a $$\mathcal{K}$$ function $$\alpha$$, defined on $$[0, \infty)$$, and a nonnegative constant $$\beta$$ s.t. $\left\|(H u)_{\tau}\right\|_{\mathcal{L}} \leq \alpha\left(\left\|u_{\tau}\right\|_{\mathcal{L}}\right)+\beta$ for all $$u \in \mathcal{L}_e^m$$ and $$\tau \in [0, \infty)$$. It is finite-gain $$\mathcal{L}$$ stable if there exist nonnegative const. $$\gamma$$ and $$\beta$$ s.t. $\left\|(H u)_{\tau}\right\|_{\mathcal{L}} \leq \gamma \left\|u_{\tau}\right\|_{\mathcal{L}} + \beta$ for all $$u \in \mathcal{L}_e^m$$ and $$\tau \in [0, \infty)$$. Note that the definition of $$\mathcal{L}_\infty$$ stability is same as BIBO stability. Definition 2: A mapping $$H : \mathcal{L}_e^m \rightarrow \mathcal{L}_e^q$$ is small-signal $$\mathcal{L}$$ stable (small-signal finite-gain $$\mathcal{L}$$ stable) if there is a positive const. $$r$$ s.t. inequality in definition 1 is satisfied for all $$u \in \mathcal{L}_e^m$$ with $$\sup_{0 \le t \le \tau} \|u(t)\| \le r$$ ## $$\mathcal{L}$$ Stability of State Models Consider the system: \begin{equation}\begin{aligned}\dot{x} &=f(t, x, u), \quad x(0)=x_{0} \\y &=h(t, x, u)\end{aligned}\label{statemodel}\end{equation} where $$x\in \mathbb{R}^n$$, $$u\in \mathbb{R}^m$$, $$y\in \mathbb{R}^q$$, $$f:[0, \infty) \times D \times D_u \rightarrow \mathbb{R}^n$$ is piecewise cont. in $$t$$ and LL in $$(x,u)$$; $$u:[0, \infty) \times D \times D_u \rightarrow \mathbb{R}^q$$ is piecewise cont. in $$t$$ and cont. in $$(x,u)$$; $$D \subset \mathbb{R}^n$$ is a domain that contains $$x=0$$, and $$D_u \subset \mathbb{R}^m$$ is a domain that contains $$u = 0$$. Suppose $$x=0$$ is an equi. point of the unforced system $\begin{equation}\dot{x} = f(t,x,0)\label{unforced}\end{equation}$ Theorem 1: Consider the system $$\eqref{statemodel}$$ and take $$r > 0$$ and $$r_u > 0$$ s.t. $$\{ \|x\| \le r \} \subset D$$ and $$\{ \|u\| \le r_u \} \subset D_u$$. Suppose that • $$x=0$$ is an ES equi. ponit of $$\eqref{unforced}$$, and there is a $$V(t, x)$$ that satisfies $c_{1}\|x\|^{2} \leq V(t, x) \leq c_{2}\|x\|^{2} \\\frac{\partial V}{\partial t}+\frac{\partial V}{\partial x} f(t, x, 0) \leq -c_{3}\|x\|^{2} \\\left\|\frac{\partial V}{\partial x}\right\| \leq c_{4}\|x\|$ for all $$(t,x) \in [0, \infty) \times D$$ for some positive const. $$c_{1,2,3,4}$$ • $$f$$ and $$h$$ satisfy the inequalities \begin{align}\|f(t, x, u)-f(t, x, 0)\| \leq L\|u\| \\\|h(t, x, u)\| \leq \eta_{1}\|x\|+\eta_{2}\|u\| \label{hbound}\end{align} for all $$(t,x,u) \in [0, \infty) \times D \times D_u$$ for some nonnegative const. $$L, \eta_{1,2}$$ Then, for each $$x_0$$ with $$\|x_0\| \le r \sqrt{c_1 / c_2}$$, the system $$\eqref{statemodel}$$ is small-signal finite-gain $$\mathcal{L}_p$$ stable for each $$p \in [1, \infty]$$. In particular, for each $$u \in \mathcal{L}_{pe}$$ with $$\sup_{0 \le t \le \tau} \| u(t) \| \le \min \{ r_u, c_1 c_3 r / (c_2 c_4 L) \}$$, the output $$y(t)$$ satisfies $\begin{equation}\left\|y_{\tau}\right\|_{\mathcal{L}_{p}} \leq \gamma \left\|u_{\tau}\right\|_{\mathcal{L}_{p}} + \beta\label{outputgain}\end{equation}$ for all $$\tau \in [0, \infty)$$, with $\gamma=\eta_{2}+\frac{\eta_{1} c_{2} c_{4} L}{c_{1} c_{3}}, \quad \beta=\eta_{1}\left\|x_{0}\right\| \sqrt{\frac{c_{2}}{c_{1}}} \rho, \text { where } \rho=\left\{\begin{array}{ll}{1,} & {\text { if } p=\infty} \\{\left(\frac{2 c_{2}}{c_{3} p}\right)^{1 / p},} & {\text { if } p \in[1, \infty)}\end{array}\right.$ Furthermore, if the origin is GES and all the assumptions hold globally (with $$D = \mathbb{R}^n$$ and $$D_u = \mathbb{R}^m$$), then, for each $$x_0 \in \mathbb{R}^n$$, the system $$\eqref{statemodel}$$ if finite-gain $$\mathcal{L}_p$$ stable for each $$p \in [1, \infty)$$. Corollary 1: Suppose that in some neighborhood of $$(x=0, u=0)$$, the function $$f(t,x,u)$$ is cont. diff., the Jacobian matrices $$\partial f / \partial x$$ and $$\partial f / \partial u$$ are bounded, uniformly in $$t$$, and $$h(t,x,u)$$ satisfies $$\eqref{hbound}$$. If the origin is an ES equi. point of $$\eqref{unforced}$$, then there is a const. $$r_0 > 0$$ s.t. for each $$x_0$$ with $$\|x_0\| < r_0$$, the system $$\eqref{statemodel}$$ is small-signal finite-gain $$\mathcal{L}_p$$ stable for each $$p \in [1, \infty]$$. Furthermore, if all the assumptions hold globally and the origin is a GES equi. point of $$\eqref{unforced}$$, then for each $$x_0 \in \mathbb{R}^n$$, the system $$\eqref{statemodel}$$ if finite-gain $$\mathcal{L}_p$$ stable for each $$p \in [1, \infty]$$ Corollary 2: The LTI system \begin{equation}\begin{aligned}\dot{x} &=A x+B u \\y &=C x+D u\end{aligned}\label{ltiinput}\end{equation} is finite-gain $$\mathcal{L}_p$$ stable for each $$p \in [1, \infty]$$ if $$A$$ is Hurwitz. Moreover, $$\eqref{outputgain}$$ is satisfied with $\gamma=\|D\|_{2}+\frac{2 \lambda_{\max }^{2}(P)\|B\|_{2}\|C\|_{2}}{\lambda_{\min }(P)}, \quad \beta=\rho\|C\|_{2}\left\|x_{0}\right\| \sqrt{\frac{\lambda_{\max }(P)}{\lambda_{\min }(P)}}, \text { where } \rho=\left\{\begin{array}{ll}{1,} & {\text { if } p=\infty} \\{\left(\frac{2 \lambda_\max(P)}{p}\right)^{1 / p},} & {\text { if } p \in[1, \infty)}\end{array}\right.$ and $$P$$ is the solution of the Lyapunov equation $$PA + A^TP = - I$$ Theorem 2: Consider the system $$\eqref{statemodel}$$ and take $$r > 0$$ s.t. $$\{ \|x\| \le r \} \subset D$$. Suppose that • $$x=0$$ is an UAS equi. ponit of $$\eqref{unforced}$$, and there is a $$V(t, x)$$ that satisfies $\alpha_{1}(\|x\|) \leq V(t, x) \leq \alpha_{2}(\|x\|) \\\frac{\partial V}{\partial t}+\frac{\partial V}{\partial x} f(t, x, 0) \leq -\alpha_{3}(\|x\|) \\\left\|\frac{\partial V}{\partial x}\right\| \leq \alpha_4(\|x\|)$ for all $$(t,x) \in [0, \infty) \times D$$ for some $$\mathcal{K}$$ functions $$\alpha_{1,2,3,4}$$ • $$f$$ and $$h$$ satisfy the inequalities \begin{align}\|f(t, x, u)-f(t, x, 0)\| \leq \alpha_5(\|u\|) \\\|h(t, x, u)\| \leq \alpha_6(\|x\|) + \alpha_7(\|u\|) + \eta \label{hbound2}\end{align} for all $$(t,x,u) \in [0, \infty) \times D \times D_u$$ for some $$\mathcal{K}$$ functions $$\alpha_{5,6,7}$$, and a nonnegative const. $$\eta$$ Then, for each $$x_0$$ with $$\|x_0\| \le \alpha_2^{-1}(\alpha_1(r))$$, the system $$\eqref{statemodel}$$ is small-signal $$\mathcal{L}_\infty$$ stable. Corollary 3: Suppose that in some neighborhood of $$(x=0, u=0)$$, the function $$f(t,x,u)$$ is cont. diff., the Jacobian matrices $$\partial f / \partial x$$ and $$\partial f / \partial u$$ are bounded, uniformly in $$t$$, and $$h(t,x,u)$$ satisfies $$\eqref{hbound2}$$. If the origin is an UAS equi. point of $$\eqref{unforced}$$, then the system $$\eqref{statemodel}$$ is small-signal $$\mathcal{L}_\infty$$ stable. Theorem 3: Consider the system $$\eqref{statemodel}$$ with $$D = \mathbb{R}^n$$ and $$D_u = \mathbb{R}^m$$. Suppose that • The system $$\dot{x} =f(t, x, u), \quad x(0)=x_{0}$$ is ISS • $$h$$ satisfies $$\eqref{hbound2}$$ Then, for each $$x_0 \in \mathbb{R}^n$$, the system $$\eqref{statemodel}$$ is $$\mathcal{L}_\infty$$ stable. ## $$\mathcal{L}_2$$ Gain Theorem 4: Consider the system $$\eqref{ltiinput}$$ where $$A$$ is Hurwitz. Let $$G(s) = C (sI - A)^{-1} B + D$$. Then, the $$\mathcal{L}_2$$ gain of the system is $$\sup_{\omega \in \mathbb{R}} \| G(j \omega) \|_2$$ Theorem 5: Consider the time-invariant nonlinear system \begin{equation}\begin{aligned}\dot{x} &=f(x)+G(x) u, \quad x(0)=x_{0} \\y &=h(x)\end{aligned}\label{tinonlinear}\end{equation} where $$f(x)$$ is LL, and $$G(x), h(x)$$ are cont. over $$\mathbb{R}^n$$. The matrix $$G \in \mathbb{R}^{n \times m}$$ and $$h : \mathbb{R}^n \rightarrow \mathbb{R}^q$$. $$f(0)=0, h(0)=0$$. Let $$\gamma$$ be a positive number and suppose there is a cont. diff. PSD function $$V(x)$$ that satisfies the Hamilton-Jacobi inequality $\mathcal{H}(V, f, G, h, \gamma) \stackrel{\text { def }}{=} \frac{\partial V}{\partial x} f(x)+\frac{1}{2 \gamma^{2}} \frac{\partial V}{\partial x} G(x) G^{T}(x)\left(\frac{\partial V}{\partial x}\right)^{T}+\frac{1}{2} h^{T}(x) h(x) \leq 0$ for all $$x \in \mathbb{R}^n$$. Then, for each $$x_0 \in \mathbb{R}^n$$, the system $$\eqref{tinonlinear}$$ is finite-gain $$\mathcal{L}_2$$ stable and its $$\mathcal{L}_2$$ gain is less than or equal to $$\gamma$$. Corollary 4: Suppose the assumption of Theorem 5 are satisfied on a domain $$D \subset \mathbb{R}^n$$ that contains the origin. Then, for any $$x_0 \in D$$ and any $$u \in \mathcal{L}_{2e}$$ for which the solution $$x$$ of $$\eqref{tinonlinear}$$ satisfies $$x(t) \in D$$ for all $$t \in [0, \tau]$$, we have $\left\|y_{\tau}\right\|_{\mathcal{L}_{2}} \leq \gamma\left\|u_{\tau}\right\|_{\mathcal{L}_{2}}+\sqrt{2 V\left(x_{0}\right)}$ Lemma 1: Suppose the assumption of Theorem 5 are satisfied on a domain $$D \subset \mathbb{R}^n$$ that contains the origin, $$f(x)$$ is a cont. diff function, and $$x=0$$ is an AS equi. point of $$\dot{x} = f(x)$$. Then, there is $$k_1 > 0$$ s.t. for each $$x_0$$ with $$\| x_0 \| \le k_1$$, the system $$\eqref{tinonlinear}$$ is small-signal finite-gain $$\mathcal{L}_2$$ stable with $$\mathcal{L}_2$$ gain less than or equal to $$\gamma$$ Lemma 2: Suppose the assumption of Theorem 5 are satisfied on a domain $$D \subset \mathbb{R}^n$$ that contains the origin, $$f(x)$$ is a cont. diff function, and no solution of $$\dot{x} = f(x)$$ can stay identically in $$S = \{ x\in D | h(x) =0 \}$$ other than $$x(t) \equiv 0$$. Then, the origin of $$\dot{x} = f(x)$$ is AS and there is $$k_1 > 0$$ s.t. for each $$x_0$$ with $$\| x_0 \| \le k_1$$, the system $$\eqref{tinonlinear}$$ is small-signal finite-gain $$\mathcal{L}_2$$ stable with $$\mathcal{L}_2$$ gain less than or equal to $$\gamma$$ ## Feedback Systems Consider two systems $$H_1 : \mathcal{L}_e^m \rightarrow \mathcal{L}_e^q$$ and $$H_2 : \mathcal{L}_e^q \rightarrow \mathcal{L}_e^m$$. Suppose both systems are finite-gain $$\mathcal{L}$$ stable, that is ${\left\|y_{1 \tau}\right\|_{\mathcal{L}} \leq} {\gamma_{1}\left\|e_{1 \tau}\right\|_{\mathcal{L}}+\beta_{1}, \quad \forall e_{1} \in \mathcal{L}_{e}^{m}, \forall \tau \in[0, \infty)} \\{\left\|y_{2 \tau}\right\|_{\mathcal{L}}} {\leq \gamma_{2}\left\|e_{2 \tau}\right\|_{\mathcal{L}}+\beta_{2}, \quad \forall e_{2} \in \mathcal{L}_{e}^{q}, \forall \tau \in[0, \infty)}$ Suppose further that the feedback system is well defined: for every pair of inputs $$u_1 \in \mathcal{L}_e^m$$ and $$u_2 \in \mathcal{L}_e^q$$, there exist unique outputs $$e_1, y_2 \in \mathcal{L}_e^m$$ and $$e_2, y_1 \in \mathcal{L}_e^q$$. Define $u=\begin{bmatrix}{u_{1}} \\ {u_{2}}\end{bmatrix}, \quad y=\begin{bmatrix}{y_{1}} \\ {y_{2}}\end{bmatrix}, \quad e=\begin{bmatrix} {e_{1}} \\ {e_{2}}\end{bmatrix}$ The question is whether the feedback connection, when viewed as a mapping from $$u$$ to $$e$$ or a mapping from $$u$$ to $$y$$, is finite-gain $$\mathcal{L}$$ stable. The two statements are equivalent. Theorem 6: The feedback connection is finite-gain $$\mathcal{L}$$ stable if $$\gamma_1 \gamma_2 < 1$$. ]]> <p>ME 583 Review. Based on <em>Nonlinear Systems (Hassan K. Khalil)</em> book.</p> Matrix Derivatives https://silencial.github.io/matrix-derivative/ 2019-08-01T00:00:00.000Z 2019-08-01T00:00:00.000Z Notes on doing derivatives w.r.t. matrix/vector # Definition • $$f$$: real value function. • Bold lowercase letter ($$\mathbf{x}, \mathbf{y}, \mathbf{z}$$): Vector. • Uppercase letter ($$X, Y, Z$$): Matrix # Properties • $$\nabla_x f = (\nabla_{x^T} f)^T$$ . • $$\delta f \approx \sum_{i, j}\left(\nabla_{X} f\right)_{i j}(\delta X)_{i j}=\operatorname{tr}\left((\nabla f)^{T} \delta X\right)$$ • If $$y = f(\mathbf{u}), \mathbf{u}=\mathbf{g}(\mathbf{x})$$, then $$\displaystyle \frac{\partial f}{\partial \mathbf{x}}=\left(\frac{\partial \mathbf{u}}{\partial \mathbf{x}}\right)^{T} \frac{\partial f}{\partial \mathbf{u}}$$. ## Vector • $$\nabla A \mathbf{x}=A$$ • $$\nabla\left(\mathbf{a}^{\mathrm{T}} \mathbf{x}\right)=\mathbf{a}$$ • $$\nabla\|\mathbf{x}\|_{2}^{2}=\nabla\left(\mathbf{x}^{\mathbf{T}} \mathbf{x}\right)=2 \mathbf{x}$$ • $$\nabla\left(\mathbf{x}^{T} A \mathbf{x}\right)=\left(A+A^{T}\right) \mathbf{x}$$ • $$\nabla\left(\mathbf{u}^{\mathrm{T}} \mathbf{v}\right)=\left(\nabla_{\mathbf{x}} \mathbf{u}\right)^{T} \mathbf{v}+\left(\nabla_{\mathbf{x}} \mathbf{v}\right)^{T} \mathbf{u}$$ • $$\nabla_{\mathbf{x}}(\alpha(\mathbf{x}) \mathbf{f}(\mathbf{x}))=\mathbf{f}(\mathbf{x}) \nabla_{\mathbf{x}^{\mathrm{T}}} \alpha(\mathbf{x})+\alpha(\mathbf{x}) \nabla_{\mathbf{x}} \mathbf{f}(\mathbf{x})$$ ## Trace Cyclic property: $$\operatorname{tr}\left(A_{1} A_{2} \cdots A_{n}\right)=\operatorname{tr}\left(A_{2} A_{3} \cdots A_{n} A_{1}\right)=\cdots=\operatorname{tr}\left(A_{n-1} A_{n} A_{1} \cdots A_{n-2}\right)=\operatorname{tr}\left(A_{n} A_{1} \cdots A_{n-2} A_{n-1}\right)$$ Frobenius Norm: $$\|A\|_F^2 = \operatorname{tr}(A^T A)$$ • $$\nabla \operatorname{tr}\left(A^{T} X\right)=\nabla \operatorname{tr}\left(A X^{T}\right)=A$$, $$\nabla \operatorname{tr}(A X)=\nabla \operatorname{tr}(X A)=A^{T}$$ • $$\nabla \operatorname{tr}\left(X A X^{T} B\right)=B^{T} X A^{T}+B X A$$ • $$\nabla \mathbf{a}^{T} X \mathbf{b}=\mathbf{a} \mathbf{b}^{T}$$ • $$\nabla \mathbf{a}^{T} X^{T} X \mathbf{a}=2 X \mathbf{a} \mathbf{a}^{T}$$ • $$\nabla_{X}|X|=|X|\left(X^{-1}\right)^{T}$$ ## Matrix • If $$y = f(U), U = G(X)$$, then $$\displaystyle \frac{\partial y}{\partial x_{i j}}=\operatorname{tr}\left(\left(\frac{\partial y}{\partial U}\right)^{T} \frac{\partial U}{\partial x_{i j}}\right)$$ • If $$f(Y): \mathbb{R}^{m\times p} \rightarrow \mathbb{R}$$ and $$Y = AX + B$$, then $$\nabla_{X} f(A X+B)=A^{T} \nabla_{Y} f$$ • $$\nabla_{\mathbf{x}}^{2} f(\mathbf{A} \mathbf{x}+\mathbf{b})=A^{T}\left(\nabla_{\mathbf{y}}^{2} f\right) A$$ • $$\nabla_{X} f(X C+D)=\left(\nabla_{Y} f\right) C^{T}$$ # Computation Graph For batch normalization: \begin{aligned}\mu_B &\leftarrow \frac{1}{m} \sum_{i=1}^m x_i \\\sigma_B^2 &\leftarrow \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2 \\\hat{x}_i &\leftarrow \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \\y_i &\leftarrow \gamma \hat{x}_i + \beta\end{aligned} Draw the graph • For $$\hat{x}_i$$ there is only one path $$\hat{x}_i \rightarrow y_i \rightarrow l$$. So that $$\displaystyle \frac{\partial l}{\partial \hat{x}_i} = \frac{\partial l}{\partial y_i} \frac{\partial y_i}{\partial \hat{x}_i} = \frac{\partial l}{\partial y_i} \gamma$$ • For $$\gamma$$ there are $$m$$ paths $$\forall i, \gamma \rightarrow y_i \rightarrow l$$. So that $$\displaystyle \frac{\partial l}{\partial \gamma} = \sum_i \frac{\partial l}{\partial y_i} \frac{\partial y_i}{\partial \gamma} = \sum_i \frac{\partial l}{\partial y_i} \hat{x}_i$$ • For $$\beta$$, similar to $$\gamma$$. $$\displaystyle \frac{\partial l}{\partial \beta} = \sum_i \frac{\partial l}{\partial y_i} \frac{\partial y_i}{\partial \beta} = \sum_i \frac{\partial l}{\partial y_i}$$ • For $$\sigma_B^2$$, there are $$m$$ paths. So that $$\displaystyle \frac{\partial l}{\partial \sigma_B^2} = \sum_i \frac{\partial l}{\partial \hat{x}_i} \frac{\partial \hat{x}_i}{\partial \sigma_B^2} = \sum_i \frac{\partial l}{\partial \hat{x}_i} \cdot -\frac{1}{2} (x_i - \mu_B) (\sigma_B^2 + \epsilon)^{-3/2}$$ • For $$\mu_B$$, there are $$2m$$ paths $$\forall i, \mu_B \rightarrow \hat{x}_i \rightarrow l, \mu_B \rightarrow \sigma_B^2 \rightarrow l$$. So that $$\displaystyle \frac{\partial l}{\partial \mu_B} = \sum_i \left( \frac{\partial l}{\partial \hat{x}_i} \frac{\partial \hat{x}_i}{\partial \mu_B} + \frac{\partial l}{\partial \sigma_B^2} \frac{\partial \sigma_B^2}{\partial \mu_B}\right) = \sum_i \frac{\partial l}{\partial \hat{x}_i} \frac{-1}{\sqrt{\sigma_B^2 + \epsilon}} + \sum_i \frac{\partial l}{\partial \sigma_B^2} \sum_j \frac{2}{m} (x_j - \mu_B) = \sum_i \frac{\partial l}{\partial \hat{x}_i} \frac{-1}{\sqrt{\sigma_B^2 + \epsilon}}$$ • For $$x_i$$, there are $$3$$ paths $$x_i \rightarrow \hat{x}_i \rightarrow l, x_i \rightarrow \sigma_B^2 \rightarrow l, x_i \rightarrow \mu_B \rightarrow l$$. So that = + + = + (x_i - _B) + # Example ## Least Square Expand: \begin{aligned}\nabla_{\mathbf{x}}\|A \mathbf{x}-\mathbf{b}\|_{2}^{2} &= \nabla_{\mathbf{x}}(A \mathbf{x}-\mathbf{b})^{T}(A \mathbf{x}-\mathbf{b}) \\&=\nabla_{\mathbf{x}}\left(\mathbf{x}^{T} A^{T} A \mathbf{x}\right)-2 \nabla_{\mathbf{x}}\left(\mathbf{b}^{T} A \mathbf{x}\right) \\&= 2 A^{T} A \mathbf{x}-2 A^{T} \mathbf{b} \\&= 2 A^{T}(A \mathbf{x}-\mathbf{b})\end{aligned} Use linear transformation form: \begin{aligned}\nabla_{\mathbf{x}} \|A \mathbf{x}-\mathbf{b} \|_{2}^{2} &= A^{T} \nabla_{A \mathbf{x}-\mathbf{b}}\|A \mathbf{x}-\mathbf{b}\|_{2}^{2} \\&= A^{T}(2(A \mathbf{x}-\mathbf{b})) \\&= 2 A^{T}(A \mathbf{x}-\mathbf{b})\end{aligned} ## Frobenius Norm Use trace: \begin{aligned}\nabla\left\|X A^{T}-B\right\|_{F}^{2} &= \nabla \operatorname{tr}\left(\left(X A^{T}-B\right)^{T}\left(X A^{T}-B\right)\right) \\&= \nabla \left(\operatorname{tr}\left(A X^{T} X A^{T}\right)-2 \operatorname{tr}\left(A X^{T} B\right)+\operatorname{tr}\left(B^{T} B\right) \right) \\&= 2 XA^TA - 2BA \\&= 2(XA^T - B)A\end{aligned} Use linear transformation form: \begin{aligned}\nabla\left\|X A^{T}-B\right\|_{F}^{2} &= \nabla\left\|A X^{T}-B^{T}\right\|_{F}^{2} \\&= \left(\nabla_{X^{T}}\left\|A X^{T}-B^{T}\right\|_{F}^{2}\right)^{T} \\&= \left(A^{T}\left(2\left(A X^{T}-B^{T}\right)\right)\right)^{T} \\&= 2\left(X A^{T}-B\right) A\end{aligned} ## PRML Calculate the gradient: $f(W)=\ln p(T | X, W, \beta)=\mathrm{const}-\frac{\beta}{2} \sum_{n}\left\|\mathbf{t}_{n}-W^{T} \phi\left(\mathbf{x}_{n}\right)\right\|_{2}^{2}$ Use F-norm: ==The sum of 2-norm square of vectors equals the F-norm square of a big matrix== \begin{aligned}\nabla f &= \nabla\left( \frac{\beta}{2} \sum_{n}\left\|\mathbf{t}_{n}-W^{T} \phi\left(\mathbf{x}_{n}\right)\right\|_{2}^{2} \right) \\&= -\frac{\beta}{2} \nabla \|T^T - W^T \Phi^T\|_F^2 \\&= -\frac{\beta}{2} \nabla \|\Phi W - T\|_F^2 \\&= -\frac{\beta}{2} \Phi^T(2 (\Phi W - T)) \\&= -\beta \Phi^T (\Phi W - T)\end{aligned} Use inner product: This method is cumbersome but more general. \begin{aligned}\nabla f &= \nabla\left( \frac{\beta}{2} \sum_{n}\left\|\mathbf{t}_{n}-W^{T} \phi\left(\mathbf{x}_{n}\right)\right\|_{2}^{2} \right) \\&= -\frac{\beta}{2} \nabla \left( \sum_n(\mathbf{t}_n - W^T \phi(\mathbf{x}_n))^T (\mathbf{t}_n - W^T \phi(\mathbf{x}_n)) \right) \\&= -\frac{\beta}{2} \sum_n \left( -2\phi(\mathbf{x}_n) \mathbf{t}_n^T + 2\phi(\mathbf{x}_n)\phi(\mathbf{x}_n)^T W \right) \\&= -\beta \sum_n \phi(\mathbf{x}_n) \left( -\mathbf{t}_n^T + \phi(\mathbf{x}_n)^T W \right) \\&= -\beta \Phi^T(\Phi^T W - T)\end{aligned} where $$\Phi^T = (\phi(\mathbf{x}_1), \dots, \phi(\mathbf{x}_n))$$ ## RNN Given the state equation, calculate the derivative of loss function $$l$$ w.r.t. $$W$$. $\mathbf{h}_t = W f(\mathbf{h}_{t-1}) + U \mathbf{x}_t + \mathbf{b}$ Since $$l = \sum_t l_t$$, we only calculate the derivate of $$l_t$$ w.r.t. $$W$$. \begin{aligned}\frac{\partial l_t}{\partial W} &= \sum_{k=1}^t \frac{\partial l_t}{\partial W^{(k)}} \\&= \sum_{k=1}^t \frac{\partial l_t}{\partial \mathbf{h}_k} (f(\mathbf{h}_{k-1}))^T \\&= \sum_{k=1}^t\left( f(\mathbf{h}_{k-1}) \frac{\partial l_t}{\partial \mathbf{h}_k^T} \right)\end{aligned} and \begin{aligned}\frac{\partial l_t}{\partial \mathbf{h}_k^T} &= \frac{\partial l_t}{\partial \mathbf{h}_t^T} \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}^T}\dots \frac{\partial \mathbf{h}_{k+1}}{\partial \mathbf{h}_{k}^T} \\&= \frac{\partial l_t}{\partial \mathbf{h}_t^T} W \operatorname{diag}(f'(\mathbf{h}_{t-1}))\dots W \operatorname{diag}(f'(\mathbf{h}_{k}))\end{aligned} Plug it in to get the final equation. ## Autoencoder With Tied-Weight Calculate the gradient: $$\sigma(\cdot)$$ is element-wise sigmoid function. $f(W) = l(\mathbf{b}_2 + W^T \sigma(W\mathbf{x} + \mathbf{b}_1))$ Treat it like $$\nabla_W f = \nabla_W l(\mathbf{b}_2 + W^T \sigma(W_c\mathbf{x} + \mathbf{b}_1)) + \nabla_W l(\mathbf{b}_2 + W_c^T \sigma(W\mathbf{x} + \mathbf{b}_1))$$ The first term: \begin{aligned}\nabla_W l(\mathbf{b}_2 + W^T \sigma(W_c\mathbf{x} + \mathbf{b}_1)) & = \left( \nabla_{W^T} l(\mathbf{b}_2 + W^T \sigma(W_c\mathbf{x} + \mathbf{b}_1)) \right)^T \\&= \left( \nabla_{\mathbf{z} = \mathbf{b}_2 + W^T \sigma(W_c\mathbf{x} + \mathbf{b}_1)}l(\mathbf{z}) \nabla_{W^T}\mathbf{z}) \right)^T \\&= \left( \nabla_{\mathbf{z}}l(\mathbf{z}) (\sigma(W_c\mathbf{x} + \mathbf{b}_1))^T \right)^T \\&= \sigma(W_c\mathbf{x} + \mathbf{b}_1) (\nabla_{\mathbf{z}}l(\mathbf{z}))^T\end{aligned} The second term: \begin{aligned}\nabla_W l(\mathbf{b}_2 + W_c^T \sigma(W\mathbf{x} + \mathbf{b}_1)) &= \nabla_{\mathbf{u} = W\mathbf{x} + \mathbf{b}_1} l(\mathbf{b}_2 + W_c^T \sigma(\mathbf{u})) \mathbf{x}^T \\&= \nabla_{\mathbf{u}^T} l(\mathbf{b}_2 + W_c^T \sigma(\mathbf{u}))^T \mathbf{x}^T \\&= \left( \nabla_{\mathbf{t}^T} l(\mathbf{t})\frac{\partial \mathbf{t}}{\partial \mathbf{v}} \frac{\partial \mathbf{v}}{\partial \mathbf{u}} \right)^T \mathbf{x}^T \\&= \left( \nabla_{\mathbf{t}^T} l(\mathbf{t}) W_c^T \operatorname{diag}(\sigma'(\mathbf{u})) \right)^T \mathbf{x}^T \\&= \operatorname{diag}(\sigma'(\mathbf{u})) W_c \nabla_\mathbf{t} l(\mathbf{t}) \mathbf{x}^T\end{aligned} where $$\mathbf{v} = \sigma(\mathbf{u}), \mathbf{t} = \mathbf{b}_2 + W_c^T \mathbf{v}$$. ]]> <p>Notes on doing derivatives w.r.t. matrix/vector</p> Machine Learning https://silencial.github.io/machine-learning/ 2019-07-26T00:00:00.000Z 2019-09-14T00:00:00.000Z Machine Learning notes. Learning = Representation + Evaluation + Optimization # Linear Regression ## Cost Function Hypothesis: $h_{\theta}(x)=\theta_{0}+\theta_{1} x$ Parameters: $\theta_0,\quad \theta_1$ Cost function: $J\left(\theta_{0}, \theta_{1}\right)=\frac{1}{2 m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}$ Goal: $\underset{\theta_{0}, \theta_{1}}{\operatorname{minimize}} J\left(\theta_{0}, \theta_{1}\right)$ ## Multiple Variables $h_{\theta}(x)=\theta^{T} x=\theta_{0} x_{0}+\theta_{1} x_{1}+\theta_{2} x_{2}+\cdots+\theta_{n} x_{n}$ Make sure to scale every feature into approximately $$-1 \le x_i \le 1$$ range. ==(Mean normalization)== Be carful of the model interpretation when Multicollinearity (multiple variables are correlated to each other) is present. ## Gradient Descent Repeat until convergence: $\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$ where $$\alpha$$ is the learning rate. • If $$\alpha$$ is too small, gradient descent can be slow. • If $$\alpha$$ is too large, gradient descent can overshoot. It may fail to converge, or even diverge. Gradient descent can get stuck in a local minimum if the cost function is not convex. For linear/logistic regression the formula are the same: $\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}$ ## Normal Equation Solve for $$\theta$$ analytically. \begin{aligned}&\frac{\partial}{\partial \theta_j} J(\theta) = 0 \\\Longrightarrow\ & \theta = (X^T X)^{-1} X^T y\end{aligned} # Regularization Overfitting: If we have too many features, the learned hypothesis may fit the training set very well, but fail to generalize to new examples. ## L1 Regularization $\|\theta\|_1 = \sum_i |\theta_i|​$ In model relying on sparse features, L1 regularization helps drive the weights of irrelevant or barely relevant features to exactly 0. ## L2 Regularization $\|\theta\|_2 = \sum_i \theta_i^2$ L2 regularization helps drive outlier weights closer to 0 but not quite to 0 ### Gradient Descent Modify the cost function to be: $J(\theta)=\frac{1}{2 m}\left[\sum_{i=1}^{m}\left(h_{\theta}(x^{(i)})-y^{(i)}\right)^{2}+\lambda \sum_{j=1}^{n} \theta_{j}^{2}\right]$ then \begin{aligned}\theta_0 &:= \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)}) x_0^{(i)} \\\theta_j &:= \theta_j \left( 1 - \alpha \frac{\lambda}{m} \right) - \alpha \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)}) x_j^{(i)}\end{aligned} ### Normal Equation $\theta =\left(X^T X + \lambda\left[\begin{smallmatrix}0 & & & \\& 1 & & \\& & \ddots & \\& & & 1\end{smallmatrix}\right]^{-1}\right)X^T y$ # Logistic Regression ## Classification With $$y \in \{0, 1\}$$. Use $$h_\theta(x)$$ as the estimated probability that $$y=1$$ on input $$x$$. We want $$0 \le h_\theta(x)\le 1$$. \begin{aligned}h_\theta(x) &= g(\theta^T x) \\&= \frac{1}{1 + e^{-\theta^T x}}\end{aligned} where $$g$$ is called the Sigmoid function. Cost function: \begin{aligned}\text{Cost}(h_\theta(x), y) =\begin{cases}-\log(h_\theta(x)) \quad &\text{if } y=1\\-\log(1 - h_\theta(x)) \quad &\text{if } y=0\end{cases}\end{aligned} And \begin{aligned}J(\theta) &=\frac{1}{m} \sum_{i=1}^{m} \operatorname{Cost}(h_{\theta}(x^{(i)}), y^{(i)}) \\&= -\frac{1}{m}\left[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}(x^{(i)})+(1-y^{(i)}) \log (1-h_{\theta}(x^{(i)}))\right]\end{aligned} ## Multi-Class Classification Train a logistic regression classifier $$h_\theta^{(i)}(x)$$ for each class $$i$$ to predict the probability that $$y = i$$. On a new input $$x$$ to make a prediction, pick the class $$i$$ that maximizes $$\max_i h_\theta^{(i)}(x)$$. # Neural Network 1. Initialize weights. 2. Choose the activation function. 3. Implement forward propagation to get $$h_\Theta(x^{(i)})$$ for any $$x^{(i)}$$. 4. Implement code to compute cost function $$J(\Theta).$$ 5. Implement backprop to compute partial derivatives. 6. Use gradient checking to see whether backprop is correct. Then disable the code. 7. Use gradient descent or other optimization method with backpropagation to minimize $$J(\Theta)$$. ## Weight Initialization Assign the weights from a Gaussian distribution with zero mean and small variance is okay for small networks, but problems with deeper network. Since the goal is to keep variance stays the same through each layer, we can use Xavier initialization. $\mathop{var}(w_i) = \frac{1}{N_{avg}} = \frac{2}{N_{in} + N_{out}}$ ==Xavier assumes zero centered activation function, double the variance if using ReLU== ## Activation Function ### Hidden Layer ==ReLU is a good default choice for most problems== Consider when choosing activation function: • Vanishing/Exploding gradients: When local gradient is very small/large, it will kill/blow the gradient during backprop. • Zero-centered: Could introduce undesirable zig-zagging dynamics in gradient updates. ### Output Layer • Use SoftMax function for classification problem. • Use Linear function for regression problem. ## Cost Function $$h_\Theta (x) \in \mathbb{R}^K, \quad (h_\Theta(x))_i = i^{th}$$ output. \begin{aligned}J(\Theta)=&-\frac{1}{m}\left[\sum_{i=1}^{m} \sum_{k=1}^{K} y_{k}^{(i)} \log \left(h_{\Theta}\left(x^{(i)}\right)\right)_{k}+\left(1-y_{k}^{(i)}\right) \log \left(1-\left(h_{\Theta}\left(x^{(i)}\right)\right)_{k}\right)\right] \\&+\frac{\lambda}{2 m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_{l}} \sum_{j=1}^{s_{l+1}}\left(\Theta_{j i}^{(l)}\right)^{2}\end{aligned} ## Backpropagation Given one training example $$(x, y)$$, use the Forward propagation to compute the cost: \begin{aligned}a^{(1)} &= x \\z^{(2)} &= \Theta^{(1)} a^{(1)} \\a^{(2)} &= g(z^{(2)}) \quad (\text{add } a_0^{(2)}) \\z^{(3)} &= \Theta^{(2)} a^{(2)} \\a^{(3)} &= g(z^{(3)}) \quad (\text{add } a_0^{(3)}) \\z^{(4)} &= \Theta^{(3)} a^{(3)} \\a^{(4)} &= h_\Theta(x) = g(z^{(4)})\end{aligned} In Backpropagation, first compute $$\delta_j^{(l)} = \frac{\partial}{\partial z_j^{(l)}} J(\Theta)$$, which is the "error" of node $$j$$ in layer $$l$$. \begin{aligned}\delta^{(4)} &= a^{(4)} - y \\\delta^{(3)} &= (\Theta^{(3)})^T \delta^{(4)} .* g'(z^{(3)}) \\\delta^{(2)} &= (\Theta^{(2)})^T \delta^{(3)} .* g'(z^{(2)}) \\\end{aligned} Then the gradient: \begin{aligned}\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) &= \frac{1}{m} a_j^{(l)} \delta_i^{(l + 1)} + \lambda \Theta_{ij}^{(l)} &\text{if } j \ne 0 \\\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) &= \frac{1}{m} a_j^{(l)} \delta_i^{(l + 1)} &\text{if } j = 0 \\\end{aligned} ## Gradient Check A numerical method to check the backprop is correct. Useful in implementation. $\frac{\partial}{\partial \theta_j} J(\theta) \approx \frac{J(\theta_1,\dots, \theta_j + \epsilon, \dots, \theta_n) - J(\theta_1,\dots, \theta_j - \epsilon, \dots, \theta_n)}{2\epsilon}$ ## Batch Normalization Apply feature scaling not just to the input layer, but also to the hidden units. Make it possible to use significantly higher learning rates, and reduces the sensitivity to initialization. Before activation function, normalize $$z$$ to $$z_{norm}$$ and then scale and shift to $$\tilde{z}$$: $\tilde{z} = \gamma z_{norm} + \beta$ where $$\beta$$ and $$\gamma$$ are learned during training. ## Dropout Regularization Besides using early stopping and L1/L2 regularization, dropout regularization is another popular approach to prevent overfitting in neural network. • Each time an example is read, some hidden units are removed with probability. • During testing there is no dropout applied, but with averaging. • Different hidden layer can have different dropout probability. • It can be viewed as a form of model averaging. # Convolutional Neural Network Useful in computer vision: • Classification • Segmentation • Localization • Detection Since using regular neural network on large images requires huge number of parameters, CNN use convolution + pooling layers to first perform feature extraction before passing it to a fully connected hidden layers. Two main advantages of CNN over just fully connected layers: • Parameter sharing: feature detector is likely invariant to locations. • Sparsity of connection: only connect each neuron to only a local region of the input. • Translation invariant. ## Convolution A convolution layer is made up of some predetermined number of filters. Each filter acts as a detector for a particular feature. Always use multiple filters at the same time and stack all these feature maps together. Treat convolution matrix as parameter and learn them through backprop. Use activation function to introduce nonlinearity to output, e.g. ReLU. ==Since all neurons in a feature map share the same parameters, CNN can recognize a pattern in any location once it is learned.== ## Pooling Pooling layers are in charge of downsampling the input. It can decrease feature map size while at the same time keeping the important information. The most common type of pooling is max pooling. • Given pooling window size, stride size and padding size. • Take the max value in the pooling window. Pooling also helps to make the representation become approximately invariant to small translations. This property is useful if we care more about whether some feature is present than exactly where it is. Padding (zero-padding) is used in CNN to preserve the size of the feature maps, otherwise they would shrink at each layer. ## Transfer Learning Transfer learning in CNN is common. When your interested dataset is small, train the CNN on a large dataset with similar data, then transfer learn to your dataset. Usually only the FC layers needs to be relearned during transfer. The number of layers needs to be relearned depends on the size of your dataset. # Recurrent Neural Network More flexible in architecture design: • Image Captioning • Sentiment Classification • Machine Translation • Video Classification on frame level ## Vanilla RNN We can process a sequence of vectors $$x$$ by applying a recurrence formula at every time step: \begin{aligned}h_t &= f_W (h_{t-1}, x_t) \\&= \tanh(W_{hh} h_{t-1} + W_{xh} x_t) \\\\y_t &=W_{hy} h_t\end{aligned} ==Computing gradient of $$h_0$$ involves many factors of $$W$$ and $$\tanh$$, easily lead to gradient exploding/vanishing.== ## LSTM Long Short Term Memory (LSTM) \begin{aligned}\begin{pmatrix}{i} \\ {f} \\ {o} \\ {g}\end{pmatrix}&=\begin{pmatrix}{\sigma} \\ {\sigma} \\ {\sigma} \\ {\tanh }\end{pmatrix} W\begin{pmatrix}{h_{t-1}} \\ {x_{t}}\end{pmatrix} \\c_{t} &= f \odot c_{t-1}+i \odot g \\h_{t} &= o \odot \tanh \left(c_{t}\right)\end{aligned} • $$f$$: Forget gate, whether to erase cell • $$i$$: Input gate, whether to write to cell • $$g$$: Gate gate, how much to write to cell • $$o$$: Output gate, how much to reveal cell # Generative Adversarial Network • Generator network: try to fool the discriminator by generating real-looking images. • Discriminator network: try to distinguish between real and fake images. Train jointly with the minimax objective function: $\min _{\theta_{g}} \max _{\theta_{d}}\left[\mathbb{E}_{x \sim p_{d a t a}} \log D_{\theta_{d}}(x)+\mathbb{E}_{z \sim p(z)} \log \left(1-D_{\theta_{d}}\left(G_{\theta_{g}}(z)\right)\right)\right]$ In practice, alternate between: 1. Gradient ascent on discriminator $\max _{\theta_{d}}\left[\mathbb{E}_{x \sim p_{d a t a}} \log D_{\theta_{d}}(x)+\mathbb{E}_{z \sim p(z)} \log \left(1-D_{\theta_{d}}\left(G_{\theta_{g}}(z)\right)\right)\right]$ 2. Gradient ascent on generator $\min _{\theta_{g}} \mathbb{E}_{z \sim p(z)} \log \left(D_{\theta_{d}}\left(G_{\theta_{g}}(z)\right)\right)$ ==The reason to use gradient ascent instead of gradient descent for the generator is to put more gradient signal on the region where samples are bad.== # Reinforcement Learning Problems involving an agent interacting with an environment, which provides numeric reward signals. The goal is to learn how to take actions in order to maximize reward. ## Markov Decision Process • $$\mathcal{S}$$: set of possible states • $$\mathcal{A}$$: set of possible actions • $$\mathcal{R}$$: distribution of reward given state and action • $$\mathbb{P}$$: transition probability • $$\gamma$$: discount factor 1. At time step $$t=0$$, environment samples initial state $$s_0 \sim p(s_0)$$ 2. For $$t=0$$ until done: 1. Agent selects action $$a_t$$ 2. Environment samples reward $$r_t \sim R(\cdot \mid s_t, a_t)$$ 3. Environment samples next state $$s_{t+1} \sim P(\cdot \mid s_t, a_t)$$ 4. Agent receives reward $$r_t$$ and next state $$s_{t+1}$$ A policy $$\pi: \mathcal{S} \to \mathcal{A}$$ specifies what action to take in each state. Objective: find policy $$\pi^* = \arg\max_\pi \mathbb{E}[\sum_t \gamma^t r_t]$$ ## Basics ### Value Function Measure how good is a state. The value function at state $$s$$, is the expected cumulative reward from following the policy from state $$s$$: $V^{\pi}(s)=\mathbb{E}\left[\sum_{t \geq 0} \gamma^{t} r_{t} | s_{0}=s, \pi\right]$ ### Q-Value Function Measure how good is a state-action pair. The Q-value function at state $$s$$ and action $$a$$, is the cumulative reward from taking action $$a$$ in state $$s$$ and then following the policy: $Q^{\pi}(s, a)=\mathbb{E}\left[\sum_{t \geq 0} \gamma^{t} r_{t} | s_{0}=s, a_{0}=a, \pi\right]$ ### Bellman Equation The optimal Q-value function $$Q^*$$ is the maximum expected cumulative reward achievable from a given state-action pair: $Q^*(s, a) = \max_\pi \mathbb{E}\left[\sum_{t \geq 0} \gamma^{t} r_{t} | s_{0}=s, a_{0}=a, \pi\right]$ $$Q^*$$ satisfies the Bellman equation: $Q^{*}(s, a)=\mathbb{E}_{s^{\prime} \sim \mathcal{E}}\left[r+\gamma \max _{a^{\prime}} Q^{*}\left(s^{\prime}, a^{\prime}\right) | s, a\right]$ ## Solver ### Value Iteration Use Bellman equation as an iterative update: $Q_{i+1}(s, a)=\mathbb{E}\left[r+\gamma \max _{a^{\prime}} Q_i\left(s^{\prime}, a^{\prime}\right) | s, a\right]$ $$Q_i$$ will converge to $$Q^*$$ as $$i\to \infty$$ ==Problem: Must compute $$Q(s,a)$$ for every state-action pair.== ### Q-Learning Use a function approximator to estimate $$Q(s,a)$$, e.g. a neural network. $$Q(s,a; \theta) \approx Q^*(s,a)$$. Loss function is $L_{i}\left(\theta_{i}\right)=\mathbb{E}_{s, a \sim \rho(\cdot)}\left[\left(y_{i}-Q\left(s, a ; \theta_{i}\right)\right)^{2}\right]$ where $$y_i = \mathbb{E}_{s'\sim \mathcal{E}}\left[r+\gamma \max _{a^{\prime}} Q_i\left(s^{\prime}, a^{\prime}\right) | s, a\right]$$ ### Policy Gradients Sometimes the $$Q$$-function can be very complicated while the policy are much simpler. Gradient ascent on policy parameters with rewards: \begin{aligned}J(\theta) &= \mathbb{E}\left[ \sum_{t\ge 0} \gamma^t r_t | \pi_\theta \right] \\&= \mathbb{E}_{\tau\sim p(\tau; \theta)} [r(\tau)] \\&= \int_\tau r(\tau) p(\tau; \theta) d\tau\end{aligned} and \begin{aligned}\nabla_\theta J(\theta) &= \int_\tau r(\tau) \nabla_\theta p(\tau; \theta) d\tau \\&= \int_\tau \big(r(\tau) \nabla_\theta \log p(\tau; \theta)\big) p(\tau; \theta) d\tau \\&= \mathbb{E}_{\tau\sim p(\tau; \theta)} [r(\tau) \nabla_\theta \log p(\tau; \theta)]\end{aligned} plug in $$p(\tau ; \theta)=\prod_{t \geq 0} p\left(s_{t+1} | s_{t}, a_{t}\right) \pi_{\theta}\left(a_{t} | s_{t}\right)$$ to get $\nabla_\theta J(\theta) \approx \sum_{t\ge 0} r(\tau) \nabla_\theta \log \pi_\theta (a_t | s_t)$ ### Variance Reduction The basic of policy gradients is that if a trajectory is good then all its actions were good. In expectation, it averages out. But it also suffers from high variance because credit assignment is really hard. • Push up probabilities of an action seen, only by the cumulative future reward from that state $\nabla_\theta J(\theta) \approx \sum_{t\ge 0} \left( \sum_{t'\ge t} r_{t'} \right) \nabla_\theta \log \pi_\theta (a_t | s_t)$ • Use discount factor $$\gamma$$ to ignore delayed effects $\nabla_\theta J(\theta) \approx \sum_{t\ge 0} \left( \sum_{t'\ge t} \gamma^{t' - t} r_{t'} \right) \nabla_\theta \log \pi_\theta (a_t | s_t)$ • Introduce a baseline function dependent on the state $\nabla_\theta J(\theta) \approx \sum_{t\ge 0} \left( \sum_{t'\ge t} \gamma^{t' - t} r_{t'} - b(s_t) \right) \nabla_\theta \log \pi_\theta (a_t | s_t)$ • Baseline $\nabla_\theta J(\theta) \approx \sum_{t\ge 0} \big( Q^{\pi_\theta}(s_t, a_t) - V^{\pi_\theta}(s_t) \big) \nabla_\theta \log \pi_\theta (a_t | s_t)$ # Naive Bayes Assumption: independence between the features. It simplifies the classification task dramatically and work well in document classification and spam filtering. Given training data $$X = (X_1, X_2, \dots, X_n)$$, the probability of $$X$$ belonging to class $$C_k$$ is given by \begin{aligned}P\left(C_{k} | X_{1}, \ldots, X_{\mathrm{n}}\right) &= \frac{P\left(X_{1}, \ldots, X_{\mathrm{n}} | C_{k}\right) P\left(C_{k}\right)}{P\left(X_{1}, \ldots, X_{\mathrm{n}}\right)} \\&=\frac{P\left(X_{1} | C_{k}\right) \ldots P\left(X_{\mathrm{n}} | C_{k}\right) P\left(C_{k}\right)}{P\left(X_{1}, \ldots,X_{\mathrm{n}}\right)}\end{aligned} Ignore the normalize term and use Maximum A Posteriori (MAP) classification rule to get the class number $\hat{y} = \mathop{\arg\max}_x p(C_k) \prod_{i=1}^n p(x_i | C_k)$ ## Distribution For discrete value, the Bayes approach is intuitive. For continuous value, we can either • Use binning to discretize the feature values to obtain a new set of Bernoulli-distributed features. • Or assuming it has Gaussian distribution. To avoid unseen features $$p(\mathbf{x} | C_k) = 0$$ wipe out all information in the other probabilities, use Laplacian correction to add 1 to each case. # Support Vector Machine Alternative view of logistic regression: $\min _{\theta} C \sum_{i=1}^{m}\left[y^{(i)} \operatorname{cost}_{1}\left(\theta^{T} x^{(i)}\right)+\left(1-y^{(i)}\right) \operatorname{cost}_{0}\left(\theta^{T} x^{(i)}\right)\right]+\frac{1}{2} \sum_{i=1}^{n} \theta_{j}^{2}$ An intuitive choice for the cost function is Hinge loss. It can be represented as $\begin{array}{cl}{\operatorname{minimize}} & {(1 / 2)\|a\|_{2}} \\ {\text { subject to }} & {a^{T} x_{i}+b \geq 1, \quad i=1, \ldots, N} \\ {} & {a^{T} y_{i}+b \leq-1, \quad i=1, \ldots, M}\end{array}$ ## Kernels Using kernel allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space. Given $$x$$, compute new feature $$f_i$$ depending on proximity to landmarks $$l^{(1)}, \dots, l^{(n)}$$: $f_i = \operatorname{similarity}(x, l^{(i)})$ Then predict $$y = 1$$ if $$\theta^T f \ge 0$$. ==To make valid kernels, similarity function need to satisfy Mercer's condition.== Gaussian kernel: $f_{i}=\exp \left(-\frac{\left\|x-l^{(i)}\right\|^{2}}{2 \sigma^{2}}\right)$ where $$l^{(i)} = x^{(i)}$$. ==Do perform feature scaling before using the Gaussian kernel.== ## Logistic Regression vs. SVM $$n=$$ number of features, $$m=$$ number of training examples. If $$n \gg m$$: Use logistic regression, or SVM without a kernel. If $$n$$ is small, $$m$$ is intermediate: Use SVM with Gaussian kernel. If $$n$$ is small, $$m$$ is large: Add more features, then use logistic regression or SVM without a kernel. # Clustering ## Hierarchical Clustering • Agglomerative (bottom up): Each point is a cluster at first and then repeatedly combine the two "nearest" clusters into one. • Divisive (top down): Start with one cluster and recursively split it. To represent a cluster, for Euclidean case, we can simply use the average of points as the centroid. For non-Euclidean case, we can define clustroid to be the point "closest" to other points, where the "closest" can be measured in different ways. To find nearest clusters, we can use the distance from the centroid/clustroid, or other measures like the minimum distance between two points from each cluster, the diameter of the merged cluster, or the average distance between points in the cluster. Stop merging clusters when $$k$$ clusters are found (if we know the number of clusters), or criterion is met based on the merging criterion, or there is only 1 cluster left. ==The best choice depends on the shape of clusters.== ## $$k$$-Means Algorithm Assumes Euclidean space. • Randomly initialize $$K$$ cluster centroids $$\mu_1, \dots, \mu_K$$. • Find the index $$c^{(i)}$$ of cluster centroid closest to point $$x^{(i)}$$. • Update cluster centroids by averaging points assigned to each cluster. Optimization objective: $J = \frac{1}{m} \sum_{i=1}^m \| x^{(i)} - \mu_{c^{(i)}} \|^2$ To choose the value of $$k$$: • Elbow method: try different $$k$$ and look at the changes in the average distance to centroid. As $$k$$ increases, the average falls rapidly until right $$k$$, then changes little. A well-defined elbow is rarely seen in practice. • Silhouette method: the silhouette value is a measure of how similar an object is to its own cluster compared to other clusters. Weakness: • Sensitive to outliers and noise. Discover and eliminate them beforehand. • Can only handle numerical data. • Difficult to detect clusters with non-spherical shapes or widely different sizes/densities. # Decision Tree Decision trees can be used for classification or regression with a tree structure. • Select attribute for root node. • Split instances into subsets • Repeat recursively for each branch, using only instances that reach the branch. ## Purity To decide which attribute to split on, use Information Gain or Gini Index to measure the purity of the split. Entropy = $$-\sum_i p_i \log p_i$$ measures the disorder or uncertainty. ### Information Gain The difference of entropy after splitting. The higher, the better. $IG(S, A) = H(S) - \sum_{v\in A} \frac{|S_v|}{|S|} H(S_v)$ Use Information Gain Ratio instead to prevent "super attributes" being selected as root. $IGR(A) = \frac{IG(A)}{IV(A)} = IG(A) \bigg/ -\sum_{v} \frac{|S_v|}{|S|} \log\left(\frac{|S_v|}{|S|}\right)$ ### Gini Index The smaller, the better. $G_i = 1 - \sum_{k=1}^n p_{ik}^2$ where $$p_{ik}$$ is the ratio of class $$k$$ instances among the training instances in the $$i$$-th node ## Pruning Change the model by deleting the child nodes of a branch node to prevent overfitting. • Pre-pruning: stop the growth early if a split would result in the purity below a threshold. • Post-pruning: remove non-significant branches form a fully grown tree. Replace subtree by a leaf node labeled with the most frequent class. ==Post-pruning is more successful in practice because it is not easy to precisely estimate when to stop growing the tree.== # Ensemble Learning Multiple learners are trained and combined to solve the same problem. • Bagging (bootstrap aggregating): build several learners independently and then to average their predictions. • Boosting: base learners are built sequentially and one tries to reduce the bias of the combined learner. ## Bagging • Sampling with replacement from the training dataset. • Train base learners on each sample separately. • Average predictions from multiple base learners. • Majority voting for classification • Averaging for regression Often used with tree, an extension is random forest. ## Boosting Extremely useful in computer vision. • Train a base learner on the entire dataset. • Find the data that are incorrectly predicted and assigned it with more weight. • Train the next learner on the weighted dataset. • Repeat the process to sequentially train base learners. • Combine the base learners using a weighted average. More weight to those with better performance. AdaBoost computes the weight by $\alpha_t = \frac{1}{2} \ln \left(\frac{1-\epsilon_t}{\epsilon_t}\right)$ where $$\epsilon_t$$ is the error rate. # PCA PCA (Principal Component Analysis) is the most popular algorithm in dimensionality reduction. It is widely used for 1. Data compression: Reduce memory. Speed up learning algorithm. 2. Data visualization. 3. Feature extraction. ==Before you implement PCA, first try running whatever you want to do with the original data/raw data. Only if that doesn't do what you want, then implement PCA.== There are two points of view leading to the same result of PCA, one is to minimize the reconstruction error with SVD: • Perform feature scaling/mean normalization to get $$X$$ • SVD: $$X = U \Sigma V^T$$ • The reduced PCA projections will be given by $$Z = X V_{:, 1:k}$$ • Reconstruct data by $$\hat{X} = V_{:, 1:k}^T Z$$ Here $$V$$ are principal directions and $$XV = US$$ are principal components. The other is to maximize the variance of the projected data. • Perform feature scaling/mean normalization to get $$X$$ • Compute the covariance matrix: $$S = X^T X$$ • Eigen decomposition: $$S = V \Sigma^2 V$$ • Same as SVD # Gradient Descent ## Variants • Batch gradient descent: Use all examples in each update. Not suitable for huge datasets. • Stochastic gradient descent: Use 1 example in each update. The randomness is good to escape from local optima. But it can never settle at the minimum. One solution is to gradually reduce the learning rate. • Mini-batch gradient descent: Take the best of both batch GD and SGD. ## Momentum GD can get trapped in local minima or saddle points. Momentum helps accelerate SGD in the relevant direction and dampens oscillations. \begin{aligned}v_t &= \beta v_{t-1} + \eta \nabla_\theta J(\theta) \\\theta_t &= \theta_{t-1} - v_t\end{aligned} • $$v$$ : plays the role of velocity. • $$\beta$$ : plays the role of friction. Must be between 0 to 1, typical choice is about 0.9 ## Adam Adam (Adaptive Moment Estimation) automatically computes adaptive learning rates for each parameter. • Compute gradient $$g_t$$ • Update biased 1st moment estimate $m_{t} = \beta_{1} m_{t-1}+\left(1-\beta_{1}\right) g_{t}$ • Update biased 2nd raw moment estimate $v_{t} = \beta_{2} v_{t-1}+\left(1-\beta_{2}\right) g_{t}^{2}$ • Update bias-corrected 1st moment estimate $\hat{m}_{t} = \frac{m_{t}}{1-\beta_{1}^t}$ • Update bias-corrected 2nd raw moment estimate $\hat{v}_{t} = \frac{v_{t}}{1-\beta_{2}^t}$ • Update parameters $w_{t} =w_{t-1}-\eta \frac{\hat{m}_{t}}{\sqrt{\hat{v}_{t}}+\epsilon}$ ==Default choice to use, especially for sparse data sets.== ## Second-Order • Quasi-Newton methods (BGFS most popular): Approximate inverse Hessian with rank 1 updates over time. • L-BFGS: Does not form the full inverse Hessian. Usually works very well in full batch, deterministic mode. # Evaluation ## Validation Set Separate the dataset to training/validation/test sets, use the validation set for model selection and hyperparameter tuning. ### Hold-Out Validation For relatively small dataset, split the data to $$6:2:2$$. For big data set, split the data to $$98:1:1$$. ### K-Fold Cross Validation For small dataset, divide the training set into $$k$$ equal size subsets. Each time, one of the $$k$$ subsets is used as the validation set. Repeat this process for $$k$$ times and average the errors. ## Hyperparameter Searching • Grid search: Brutal force to search every combination of hyperparameters. • Random search: Randomly sample and narrow the range. More effective in high-dimensional spaces. • Bayesian Optimization: ## Over/Under Fitting Plot the learning curves for test and validation sets for debugging. • High Bias: Underfit. Both training/validation error will be high. • Get additional features • More complex model • Better optimization algorithm such as Adam • Use ensemble learning — Boosting • Decrease $$\lambda$$ • High Variance: Overfit. Low training error, high validation error. • Get more training examples • Try smaller sets of features • Use ensemble learning — Bagging & Random Forest • Increase $$\lambda$$ ## Metrics ### Classification The confusion matrix: • Accuracy: $\frac{\text{TP + TN}}{\text{TP + FN + FP + TN}}$ • Precision: $\frac{\text{TP}}{\text{TP + FP}}$ • Recall: $\frac{\text{TP}}{\text{TP + FN}}$ How to choose the trade-off between precision and recall depends on the actual problem. Single metric: • F score: $F_1 = 2\frac{\text{Precision} \times \text{Recall}}{\text{Precision + Recall}}$ • AUC (area under the curve): ROC curve plots False Positive Rate TP / (TP + FN) vs. True Positive Rate FP / (FP + TN) at different classification thresholds. ### Regression • MAE (Mean Absolute Error): $$\displaystyle \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|$$ • MSE (Mean Squared Error): $$\displaystyle \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2$$ • RMSE (Root Mean Squared Error): $$\displaystyle \frac{1}{n} \sqrt{\sum_{i=1}^n (y_i - \hat{y}_i)^2}$$ • MAPE (Mean Absolute Percentage Error): $$\displaystyle \frac{1}{n} \sum_{i=1}^n \left| \frac{y_i - \hat{y}_i}{y_i} \right|$$ • $$R^2$$ (coefficient of determination): $$\displaystyle \frac{\left( \sum (x_i - \bar{x}) (y - \bar{y})\right)^2}{\sum (x_i - \bar{x})^2 (y - \bar{y})^2} = \frac{\sum_i (\hat{y}_i - \bar{y})^2}{\sum_i (\hat{y}_i - \bar{y})^2 + \sum_i (\hat{y}_i - y_i)^2}$$ • Adjusted $$R^2$$ (Prevent overfitting): $$\displaystyle R _{adj}^2 = 1-\frac{(1-R^2) (n-1)}{n-k-1}$$ where $$n$$ is the sample size and $$k$$ the total number of explanatory variables (not including the constant term). lstm ]]> <p>Machine Learning notes.</p> <ul> <li>Coursera Andrew Ng <a href="https://www.coursera.org/learn/machine-learning/home/welcome">Machine Learning</a>. My <a href="https://github.com/silencial/Machine-Learning">solution</a></li> <li>Stanford CS231n <a href="http://cs231n.stanford.edu/index.html">Convolutional Neural Networks for Visual Recognition</a>. My <a href="https://github.com/silencial/CNN-for-Visual-Recognition">solution</a></li> </ul> Convex Optimization https://silencial.github.io/convex-optimization/ 2019-07-05T00:00:00.000Z 2019-07-05T00:00:00.000Z AA/EE/ME 578 Review # Convex Sets ## Examples ### Subspaces $$S \subseteq \mathbf{R}^{n}$$ is a subspace if $x, y \in S, \quad \lambda, \mu \in \mathbf{R} \quad \Longrightarrow \lambda x+\mu y \in S$ Geometrically: $$x, y \in S \Rightarrow$$ plane through $$0, x, y \subseteq S$$. ### Affine Sets $$S \subseteq \mathbf{R}^n$$ is affine if $x, y \in S, \quad \lambda, \mu \in \mathbf{R}, \quad \lambda+\mu=1 \Longrightarrow \lambda x+\mu y \in S$ Geometrically: $$x, y \in S \Rightarrow$$ line through $$x, y \subseteq S$$. ### Convex Sets $$S \subseteq \mathbf{R}^n$$ is a convex set if $x, y \in S, \quad \lambda, \mu \geq 0, \quad \lambda+\mu=1 \Longrightarrow \lambda x+\mu y \in S$ Geometrically: $$x, y \in S \Rightarrow$$ segment $$[x, y] \subseteq S$$. ### Convex Cone $$S \subseteq \mathbf{R}^n$$ is a cone if $x \in S, \quad \lambda \geq 0 \Longrightarrow \lambda x \in S$ $$S \subseteq \mathbf{R}^n$$ is a convex cone if $x, y \in S, \quad \lambda, \mu \geq 0 \Longrightarrow \lambda x+\mu y \in S$ Geometrically: $$x, y \in S \Rightarrow$$ 'pie slice' between $$x, y \subseteq S$$. ### Combinations and Hulls $$y=\theta_{1} x_{1}+\cdots+\theta_{k} x_{k}$$ is a • linear combination of $$x_1, \dots, x_k$$ • affine combination if $$\sum \theta_i = 1$$ • convex combination if $$\sum \theta_i = 1,\ \theta_i \ge 0$$ • conic combination if $$\theta_i \ge 0$$. (linear, ...) hull of $$S$$ is the set of all (linear, ...) combinations from $$S$$. $\operatorname{conv}(S)=\bigcap\{G \mid S \subseteq G, \ G \text { convex }\}$ convex hull $$\operatorname{conv}(S)$$ is the set of all convex combinations of points in $$S$$. ### Hyperplanes and Halfspaces • Hyperplane: $$\left\{x \mid a^{T} x=b\right\}(a \neq 0)$$ • Halfspace: $$\left\{x \mid a^{T} x \leq b\right\}(a \neq 0)$$ $$a$$ is the normal vector. Hyperplanes are affine and convex; halfspaces are convex. ### Euclidean Balls and Ellipsoids Euclidean ball with center $$x_c$$ and radius $$r$$: $B\left(x_{c}, r\right)=\left\{x \mid \left\|x-x_{c}\right\|_{2} \leq r\right\}=\left\{x_{c}+r u \mid \|u\|_{2} \leq 1\right\}$ Ellipsoid: $\left\{x \mid \left(x-x_{c}\right)^{T} P^{-1}\left(x-x_{c}\right) \leq 1\right\}$ with $$P \in \mathbf{S}_{++}^{n}$$. Other representation: $$\left\{x_{c}+A u \mid \|u\|_{2} \leq 1\right\}$$ with $$A$$ square and nonsingular. ### Norm Balls and Norm Cones Norm: a function $$\|\cdot\|$$ that satisfies • $$\|x\| \geq 0 ; \ \|x\|=0$$ if and only if $$x=0$$ • $$\|t x\|=|t|\|x\|$$ for $$t \in \mathbf{R}$$ • $$\|x+y\| \leq\|x\|+\|y\|$$ Norm ball with center $$x_c$$ and radius $$r$$: $$\left\{x \mid \left\|x-x_{c}\right\| \leq r\right\}$$ Norm cone: $$\{(x, t) \mid \|x\| \leq t\}$$ ### Polyhedron Solution set of finitely many linear inequalities and equalities $A x \preceq b, \qquad C x=d$ ($$A \in \mathbf{R}^{m \times n}, \ C \in \mathbf{R}^{p \times n}, \ \preceq$$ is component-wise inequality) Polyhedron is intersection of finite number of halfspaces and hyperplanes. ### Positive Semidefinite Cone • $$\mathbf{S}_{+}^{n}=\left\{X \in \mathbf{S}^{n} \mid X \succeq 0\right\}$$: PSD matrices. A convex cone. • $$\mathbf{S}_{++}^{n}=\left\{X \in \mathbf{S}^{n} \mid X\succ 0\right\}$$: PD matrices. ## Operations That Preserve Convexity To show $$C$$ is convex set 1. Definition $x_{1}, x_{2} \in C, \quad 0 \leq \theta \leq 1 \quad \Longrightarrow \quad \theta x_{1}+(1-\theta) x_{2} \in C$ 2. Show that $$C$$ is obtained from simple convex sets by operations that preserve convexity • intersection • affine function • perspective function • linear-fractional functions ### Intersection The intersection of (any number of, even infinite) convex sets is convex. ### Affine Function Suppose $$f : \mathbf{R}^{n} \rightarrow \mathbf{R}^{m}$$ is affine ($$f(x)=A x+b$$ with $$A \in \mathbf{R}^{m \times n}, \ b \in \mathbf{R}^{m}$$) • The image of a convex set under $$f$$ is convex $S \subseteq \mathbf{R}^{n} \text { convex } \Longrightarrow f(S)=\{f(x) \mid x \in S\}$ • The inverse image $$f^{-1}(C)$$ of a convex set under $$f$$ is convex $C \subseteq \mathbf{R}^{m} \text{ convex } \quad \Longrightarrow \quad f^{-1}(C)=\left\{x \in \mathbf{R}^{n} \mid f(x) \in C\right\} \text{ convex }$ ### Perspective and Linear-Fractional Function Perspective function $$P : \mathbf{R}^{n+1} \rightarrow \mathbf{R}^{n}$$ $P(x, t)=x / t, \quad \operatorname{dom} P=\{(x, t) \mid t>0\}$ images and inverse images of convex sets under perspective are convex. Linear-fractional function $$f : \mathbf{R}^{n} \rightarrow \mathbf{R}^{m}$$ $f(x)=\frac{A x+b}{c^{T} x+d}, \qquad \text { dom } f=\left\{x \mid c^{T} x+d>0\right\}$ images and inverse images of convex sets under linear-fractional functions are convex. ## Generalized Inequalities A convex cone $$K \subseteq \mathbf{R}^{n}$$ is a proper cone if • $$K$$ is closed (contains its boundary) • $$K$$ is solid (has nonempty interior) • $$K$$ is pointed (contains no line) Generalized inequality defined by a proper cone $$K$$: $x \preceq_{K} y \ \Longleftrightarrow \ y-x \in K, \qquad x\preceq_{K} y \ \Longleftrightarrow \ y-x \in \operatorname{int} K$ ## Hyperplane Theorem ### Separating Hyperplane If $$C$$ and $$D$$ are disjoint convex sets, then there exists $$a \ne 0, \ b$$ such that $a^{T} x \leq b \ \text { for }\ x \in C, \qquad a^{T} x \geq b \ \text { for } \ x \in D$ the hyperplane $$\left\{x \mid a^{T} x=b\right\}$$ separates $$C$$ and $$D$$. Strict separation requires additional assumptions (e.g. $$C$$ is closed, $$D$$ is a singleton). ### Supporting Hyperplane Supporting hyperplane to set $$C$$ at boundary point $$x_0$$: $\left\{x \mid a^{T} x=a^{T} x_{0}\right\}$ where $$a \ne 0$$ and $$a^T x \le a^T x_0$$ for all $$x \in C$$. Supporting hyperplane theorem: If $$C$$ is convex, then there exists a supporting hyperplane at every boundary point of $$C$$. ## Dual Cones Dual cone of a cone $$K$$: $$K^*=\left\{y \mid y^{T} x \geq 0 \text { for all } x \in K\right\}$$ Examples: • $$K=\mathbf{R}_{+}^{n} : K^{*}=\mathbf{R}_{+}^{n}$$ • $$K=\mathbf{S}_{+}^{n} : K^{*}=\mathbf{S}_{+}^{n}$$ • $$K=\left\{(x, t) \mid\|x\|_{2} \leq t\right\} : K^{*}=\left\{(x, t) \mid\|x\|_{2} \leq t\right\}$$ • $$K=\left\{(x, t) \mid\|x\|_{1} \leq t\right\} : K^{*}=\left\{(x, t) \mid\|x\|_{\infty} \leq t\right\}$$ # Convex Functions $$f : \mathbf{R}^{n} \rightarrow \mathbf{R}$$ is convex if $$\operatorname{dom} f$$ is a convex set and $f(\theta x+(1-\theta) y) \leq \theta f(x)+(1-\theta) f(y)$ for all $$x, y \in \operatorname{dom} f, \ 0\le \theta \le 1$$ ## Examples ### Convex • affine: $$a^T x + b$$ • exponential: $$e^{ax}$$, for any $$a \in \mathbf{R}$$ • powers: $$x^\alpha$$ on $$\mathbf{R}_{++}$$, for $$\alpha \ge 1$$ or $$a \le 0$$ • powers of absolute value: $$|x|^p$$ on $$\mathbf{R}$$, for $$p \ge 1$$ • negative entropy: $$x\log x$$ on $$\mathbf{R}_{++}$$ • norms: $$\|x\|_{p}=\left(\sum_{i=1}^{n}\left|x_{i}\right|^{p}\right)^{1 / p}$$ for $$p \ge 1$$ • affine on matrices: $$f(X)=\operatorname{tr}\left(A^{T} X\right)+b$$ • spectral norm: $$f(X) = \|X\|_2 = \sigma_{\max}(X)$$ • quadratic: $$f(x)=(1 / 2) x^{T} P x+q^{T} x+r$$ with $$P \in \mathbf{S}^{n}$$ • least-squares: $$f(x) = \|Ax - b\|_2^2$$ • quadratic-over-linear: $$f(x, y) = x^2/y$$ with $$y > 0$$ • log-sum-exp: $$f(x)=\log \sum_{k=1}^{n} \exp x_{k}$$ ### Concave • $$f$$ in concave if $$-f$$ is convex • affine • powers: $$x^\alpha$$ on $$\mathbf{R}_{++}$$, for $$0\le \alpha \le 1$$ • logarithm: $$\log x$$ on $$\mathbf{R}_{++}$$ • $$\log\det X$$ on $$\mathbf{S}_{++}^n$$ • geometric mean: $$f(x)=\left(\prod_{k=1}^{n} x_{k}\right)^{1 / n}$$ on $$\mathbf{R}_{++}^n$$ ## Properties ### Restriction of a Convex Function to a Line $$f : \mathbf{R}^{n} \rightarrow \mathbf{R}$$ is convex iff the function $$g : \mathbf{R}^{n} \rightarrow \mathbf{R}$$ $g(t)=f(x+t v), \qquad \operatorname{dom} g=\{t \mid x+t v \in \operatorname{dom} f\}$ is convex (in $$t$$) for any $$x \in \operatorname{dom} f, \ v\in \mathbf{R}^n$$ ### First-Order Convexity Condition Differentiable $$f$$ with convex domain is convex iff $f(y) \geq f(x)+\nabla f(x)^{T}(y-x) \quad \text { for all } x, y \in \operatorname{dom} f$ ### Second-Order Convexity Condition Twice differentiable $$f$$ with convex domain is convex iff $\nabla^{2} f(x) \succeq 0 \quad \text { for all } x \in \operatorname{dom} f$ ### Epigraph and Sublevel Set $$\alpha$$-sublevel set of $$f : \mathbf{R}^{n} \rightarrow \mathbf{R}$$: $C_{\alpha}=\{x \in \operatorname{dom} f \mid f(x) \leq \alpha\}$ sublevel sets of convex functions are convex (converse is false) Epigraph of $$f : \mathbf{R}^{n} \rightarrow \mathbf{R}$$: $\operatorname{epi}f=\left\{(x, t) \in \mathbf{R}^{n+1} \mid x \in \operatorname{dom} f, \ f(x) \leq t\right\}$ $$f$$ is convex iff $$\operatorname{epi}f$$ is a convex set. ### Jensen's Inequality If $$f$$ is convex, then for $$0 \le \theta \le 1$$ $f(\theta x+(1-\theta) y) \leq \theta f(x)+(1-\theta) f(y)$ can be extended to $f(\mathbf{E}\ z) \leq \mathbf{E}\ f(z)$ for any random variable $$z$$ ## Operations That Preserve Convexity To show $$f$$ is convex function: 1. Verify definition (often simplified by restricting to a line) 2. For twice differentiable functions, show $$\nabla^2 f(x) \succeq 0$$ 3. Show that $$f$$ is obtained from simple convex functions by operations that preserve convexity • Nonnegative multiple: $$\alpha f$$ is convex if $$f$$ is convex, $$\alpha \ge 0$$ • Sum: $$f_1 + f_2$$ is convex if $$f_1, f_2$$ convex • Composition with affine function: $$f(Ax + b)$$ is convex if $$f$$ is convex • Pointwise maximum: if $$f_1, \dots, f_m$$ are convex, then $$f(x)=\max \left\{f_{1}(x), \ldots, f_{m}(x)\right\}$$ is convex • Pointwise supremum: if $$f(x, y)$$ is convex in $$x$$ for each $$y\in \mathcal{A}$$, then $$g(x)=\sup _{y \in \mathcal{A}} f(x, y)$$ is convex. • Composition of $$g : \mathbf{R}^{n} \rightarrow \mathbf{R}$$ and $$h : \mathbf{R} \rightarrow \mathbf{R}$$: $$f(x) = h(g(x))$$ is convex if • $$g$$ convex, $$h$$ convex, $$\tilde{h}$$ nondecreasing • $$g$$ concave, $$h$$ convex, $$\tilde{h}$$ nonincreasing • Composition of $$g : \mathbf{R}^{n} \rightarrow \mathbf{R}^k$$ and $$h : \mathbf{R}^k \rightarrow \mathbf{R}$$: $$f(x) = h(g(x)) = h(g_1(x), \dots, g_k(x))$$ is convex if • $$g_i$$ convex, $$h$$ convex, $$\tilde{h}$$ nondecreasing in each argument • $$g_i$$ concave, $$h$$ convex, $$\tilde{h}$$ nonincreasing in each argument • Minimization: if $$f(x, y)$$ is convex in $$(x, y)$$ and $$C$$ is convex set, then $$g(x)=\inf _{y \in C} f(x, y)$$ is convex • Perspective if a function $$f : \mathbf{R}^{n} \rightarrow \mathbf{R}$$ is the function $$g : \mathbf{R}^{n} \times \mathbf{R} \rightarrow \mathbf{R}$$ $g(x, t)=t f(x / t), \qquad \operatorname{dom} g=\{(x, t) \mid x / t \in \operatorname{dom} f, t>0\}$ $$g$$ is convex if $$f$$ is convex ## Conjugate Function The conjugate of a function $$f$$: $$f^{*}(y)=\sup _{x \in \operatorname{dom} f}\left(y^{T} x-f(x)\right)$$ is always convex. ## Quasiconvex Function $$f : \mathbf{R}^{n} \rightarrow \mathbf{R}$$ is quasiconvex if $$\operatorname{dom} f$$ is convex and the sublevel sets $S_{\alpha}=\{x \in \operatorname{dom} f \mid f(x) \leq \alpha\}$ are convex for all $$\alpha$$ $$f$$ is quasiconcave if $$-f$$ is quasiconvex. • Modified Jensen inequality: $$0 \leq \theta \leq 1 \ \Longrightarrow \ f(\theta x+(1-\theta) y) \leq \max \{f(x), f(y)\}$$ • First-order condition: $$f(y) \leq f(x) \quad \Longrightarrow \quad \nabla f(x)^{T}(y-x) \leq 0$$ • Sums of quasiconvex functions are not necessarily quasiconvex ## Log-Concave and Log-Convex Function A positive function $$f$$ is log-concave if $$\log f$$ is concave: $f(\theta x+(1-\theta) y) \geq f(x)^{\theta} f(y)^{1-\theta} \quad \text { for }\ 0 \leq \theta \leq 1$ Many common probability densities are log-concave, e.g., normal distribution. • Second-order condition: $$f(x) \nabla^{2} f(x) \preceq \nabla f(x) \nabla f(x)^{T}$$ • Product of log-concave functions is log-concave • Sum of log-concave functions is not always log-concave • Integration: if $$f : \mathbf{R}^{n} \times \mathbf{R}^{m} \rightarrow \mathbf{R}$$ is log-concave, then $$\displaystyle g(x)=\int f(x, y) dy$$ is log-concave # Convex Optimization Problems ## Optimization Problem $\begin{array}{cl}\operatorname{minimize} & {f_{0}(x)} \\\text { subject to } & {f_{i}(x) \leq 0, \quad i=1, \ldots, m} \\& {h_{i}(x)=0, \quad i=1, \ldots, p}\end{array}$ • $$x \in \mathbf{R}^n$$ is the optimization variable • $$f_0: \mathbf{R}^n \rightarrow \mathbf{R}$$ is the objective or cost function • $$f_0: \mathbf{R}^n \rightarrow \mathbf{R}, i = 1, \dots, m$$ are the inequality constraint functions • $$h_i: \mathbf{R}^n \rightarrow \mathbf{R}$$ are the equality constraint functions Optimal value $$p^{\star}=\inf \left\{f_{0}(x) \mid f_{i}(x) \leq 0,\ i=1, \ldots, m,\ h_{i}(x)=0, \ i=1, \ldots, p\right\}$$ • $$p^\star = \infty$$ if problem is infeasible • $$p^\star = -\infty$$ if problem is unbounded below ## Feasibility Problem $\begin{array}{cl}{\operatorname{minimize}} & {0} \\ {\text { subject to }} & {f_{i}(x) \leq 0, \quad i=1, \ldots, m} \\ {} & {h_{i}(x)=0, \quad i=1, \ldots, p}\end{array}$ • $$p^\star = 0$$ if constraints are feasible; any feasible $$x$$ is optimal • $$p^\star = \infty$$ if constraints are infeasible ## Convex Optimization $\begin{array}{ll}{\operatorname{minimize}} & {f_{0}(x)} \\ {\text { subject to }} & {f_{i}(x) \leq 0, \quad i=1, \ldots, m} \\ {} & Ax = b\end{array}$ • $$f_0, \dots, f_m$$ are convex; equality constraints are affine • Feasible set of a convex optimization problem is convex • Any locally optimal point of a convex problem is globally optimal ### Optimality Criterion $$x$$ is optimal iff it is feasible and $$\nabla f_{0}(x)^{T}(y-x) \geq 0$$ for all feasible $$y$$. If nonzero, $$\nabla f_0(x)$$ defines a supporting hyperplane to feasible set $$X$$ at $$x$$. • unconstrained problem: $x \in \operatorname{dom} f_{0}, \quad \nabla f_{0}(x)=0$ • equality constrained problem: $\operatorname{minimize}\ f_{0}(x) \quad \operatorname{subject to} \ A x=b$ x is optimal iff there exists a $$\nu$$ such that $x \in \operatorname{dom} f_{0}, \quad A x=b, \quad \nabla f_{0}(x)+A^{T} \nu=0$ • minimization over nonnegative orthant: $\operatorname{minimize}\ f_{0}(x) \quad \operatorname{subject to} \ x \succeq 0$ x is optimal iff $x \in \operatorname{dom} f_{0}, \qquad x \succeq 0, \qquad\left\{\begin{array}{ll}{\nabla f_{0}(x)_{i} \geq 0} & {x_{i}=0} \\ {\nabla f_{0}(x)_{i}=0} & {x_{i}>0}\end{array}\right.$ ### Equivalent Convex Problems • eliminating equality constrains $\begin{array}{ll}{\operatorname{minimize}\ (\text {over } z)} & {f_{0}\left(F z+x_{0}\right)} \\ {\text {subject to }} & {f_{i}\left(F z+x_{0}\right) \leq 0, \quad i=1, \ldots, m}\end{array}$ where $$F$$ and $$x_0$$ such that $$A x=b \ \Longleftrightarrow \ x=F z+x_{0}$$ for some $$z$$ • introducing slack variables for linear inequalities $\begin{array}{ll}{\operatorname{minimize}} & {f_{0}(x)} \\ {\text { subject to }} & {a_{i}^{T} x \leq b_{i}, \quad i=1, \ldots, m}\end{array}$ is equivalent to $\begin{array}{ll}{\operatorname{minimize}\ (\text{over } x, s)} & {f_{0}(x)} \\ {\text {subject to }} & {a_{i}^{T} x+s_{i}=b_{i}, \quad i=1, \ldots, m} \\ {} & {s_{i} \geq 0, \quad i=1, \ldots m}\end{array}$ • epigraph form: standard form convex problem is equivalent to $\begin{array}{ll}{\operatorname{minimize}\ (\text {over } x, t)} & {t} \\ {\text {subject to }} & {f_{0}(x)-t \leq 0} \\ {} & {f_{i}(x) \leq 0, \quad i=1, \ldots, m} \\ {} & {A x=b}\end{array}$ ## Quasiconvex Optimization $\begin{array}{cl}{\operatorname{minimize}} & {f_{0}(x)} \\ {\text { subject to }} & {f_{i}(x) \leq 0, \quad i=1, \ldots, m} \\ {} & {A x=b}\end{array}$ with $$f_0: \mathbf{R}^n \rightarrow \mathbf{R}$$ quasiconvex, $$f_1, \dots, f_m$$ convex. Can have locally optimal points that are not globally optimal If $$f_0$$ is quasiconvex, there exists a family of functions $$\phi_t$$ such that: • $$\phi_t(x)$$ is convex in $$x$$ for fixed $$t$$ • $$t$$-sublevel set of $$f_0$$ is $$0$$-sublevel set of $$\phi_t$$ For a fixed $$t$$, the quasiconvex optimization problem can be transferred to a convex feasibility problem in $$x$$. Bisection method can be used to find the optimal $$t$$. ## Linear Optimization ### Linear Programming $\begin{array}{cl}{\text { minimize }} & {c^{T} x+d} \\ {\text { subject to }} & {G x \preceq h} \\ {} & {A x=b}\end{array}$ • feasible set is a polyhedron. ### Linear-Fractional Programming $\begin{array}{cl}{\text { minimize }} & {f_{0}(x)} \\ {\text { subject to }} & {G x \preceq h} \\ {} & {A x=b}\end{array}$ where $f_{0}(x)=\frac{c^{T} x+d}{e^{T} x+f}, \qquad \operatorname{dom} f_{0}(x)=\left\{x \mid e^{T} x+f>0\right\}$ • a quasiconvex optimization problem • equivalent to the LP $\begin{array}{cl}{\text { minimize }} & {c^{T} y+d z} \\ {\text { subject to }} & {G y \preceq h z} \\ {} & {A y=b z} \\ {} & {e^{T} y+f z=1} \\ {} & {z \geq 0}\end{array}$ ## Quadratic Optimization ### Quadratic Programming $\begin{array}{cl}{\operatorname{minimize}} & {(1 / 2) x^{T} P x+q^{T} x+r} \\ {\text { subject to }} & {G x \preceq h} \\ {} & {A x=b}\end{array}$ • $$P \in \mathbf{S}_+^n$$, so objective is convex quadratic • minimize a convex quadratic function over a polyhedron ### Quadratically Constrained Quadratic Programming $\begin{array}{ll}{\operatorname{minimize}} & {(1 / 2) x^{T} P_{0} x+q_{0}^{T} x+r_{0}} \\ {\text { subject to }} & {(1 / 2) x^{T} P_{i} x+q_{i}^{T} x+r_{i} \leq 0, \quad i=1, \ldots, m} \\ {} & {A x=b}\end{array}$ • $$P_i \in \mathbf{S}_+^n$$, objective and constraints are convex quadratic • if $$P_1, \dots, P_m \in \mathbf{S}_{++}^n$$, feasible region is intersection of $$m$$ ellipsoids and an affine set ### Second-Order Cone Programming $\begin{array}{cl}{\operatorname{minimize}} & {f^{T} x} \\ {\text { subject to }} & {\left\|A_{i} x+b_{i}\right\|_{2} \leq c_{i}^{T} x+d_{i}, \quad i=1, \ldots, m} \\ {} & {F x=g}\end{array}$ • $$A_{i} \in \mathbf{R}^{n_{i} \times n}, F \in \mathbf{R}^{p \times n}$$ • inequalities are called second-order cone constraints • for $$n_i = 0$$, reduces to an LP; if $$c_i = 0$$, reduces to a QCQP ## Geometric Programming • Monomial function $f(x)=c x_{1}^{a_{1}} x_{2}^{a_{2}} \cdots x_{n}^{a_{n}}, \qquad \text { dom } f=\mathbf{R}_{++}^{n}$ with $$c>0,\ \alpha_i \in \mathbf{R}$$ • Posynomial function: sum of monomials $f(x)=\sum_{k=1}^{K} c_{k} x_{1}^{a_{1 k}} x_{2}^{a_{2 k}} \cdots x_{n}^{a_{n k}}, \qquad \operatorname{dom} f=\mathbf{R}_{++}^{n}$ • Geometric program $\begin{array}{ll}{\text { minimize }} & {f_{0}(x)} \\ {\text { subject to }} & {f_{i}(x) \leq 1, \quad i=1, \ldots, m} \\ {} & {h_{i}(x)=1, \quad i=1, \ldots, p}\end{array}$ with $$f_i$$ posynomial, $$h_i$$ monomial Geometric program in convex form: $\begin{array}{cl}{\operatorname{minimize}} & {\log \left(\sum_{k=1}^{K} \exp \left(a_{0 k}^{T} y+b_{0 k}\right)\right)} \\ {\text { subject to }} & {\log \left(\sum_{k=1}^{K} \exp \left(a_{i k}^{T} y+b_{i k}\right)\right) \leq 0, \quad i=1, \ldots, m} \\ {} & {G y+d=0}\end{array}$ ## Generalized Inequality Constraints $\begin{array}{ll}{\operatorname{minimize}} & {f_{0}(x)} \\ {\text { subject to }} & {f_{i}(x) \preceq_{K_{i}} 0, \quad i=1, \ldots, m} \\ {} & {A x=b}\end{array}$ • $$f_0: \mathbf{R}^n \rightarrow \mathbf{R}$$ convex; $$f_i: \mathbf{R}^n \rightarrow \mathbf{R}^{k_i}$$ $$K_i$$-convex w.r.t. proper cone $$K_i$$ • same properties as standard convex problem (convex feasible set, local optimum is global, etc.) Conic form problem: $\begin{array}{cl}\operatorname{minimize} & c^{T} x \\\text{subject to} & F x+g \preceq_{K} 0 \\& A x=b\end{array}$ extends LP ($$K = \mathbf{R}_+^m$$) to non-polyhedral cones Semidefinite programming: $\begin{array}{cl}{\operatorname{minimize}} & {c^{T} x} \\{\text { subject to }} & {x_{1} F_{1}+x_{2} F_{2}+\cdots+x_{n} F_{n}+G \preceq 0} \\{} & {A x=b}\end{array}$ with $$F_i, G \in \mathbf{S}^k$$ ### Minimum and Minimal Elements $$\preceq_K$$ is not in general a linear ordering: we can have $$x \npreceq_K y$$ and $$y \npreceq_K x$$ • $$x \in S$$ is the minimum element of $$S$$ w.r.t. $$\preceq_K$$ if $$y \in S \ \Longrightarrow \ x \preceq_K y$$ • $$x \in S$$ is the minimal element of $$S$$ w.r.t. $$\preceq_K$$ if $$y \in S, \ y \preceq_{K} x \ \Longrightarrow \ y=x$$ ## Vector Optimization General vector optimization problem: $\begin{array}{ll}\text{minimize (w.r.t. } K ) & f_{0}(x) \\\text{subject to} & f_{i}(x) \leq 0, \quad i=1, \ldots, m \\& h_{i}(x) \leq 0, \quad i=1, \ldots, p\end{array}$ vector objective $$f_{0} : \mathbf{R}^{n} \rightarrow \mathbf{R}^{q}$$, minimized w.r.t. proper cone $$K \in \mathbf{R}^{q}$$ Convex vector optimization problem: $\begin{array}{ll}\text{minimize (w.r.t. } K ) & f_{0}(x) \\\text{subject to} & f_{i}(x) \leq 0, \quad i=1, \ldots, m \\& A x=b\end{array}$ with $$f_0$$ $$K$$-convex, $$f_1,\dots, f_m$$ convex ### Optimal and Pareto Optimal Points Set of achievable objective values $$\mathcal{O}=\left\{f_{0}(x) \mid x\ \text { feasible}\right\}$$ • feasible $$x$$ is optimal if $$f_0(x)$$ is a minimum value of $$\mathcal{O}$$ • feasible $$x$$ is Pareto optimal if $$f_0(x)$$ is a minimal value of $$\mathcal{O}$$ # Duality ## Lagrange Dual Function From the standard form optimization problem we define the Lagrangian $$L : \mathbf{R}^{n} \times \mathbf{R}^{m} \times \mathbf{R}^{p} \rightarrow \mathbf{R}$$ $L(x, \lambda, \nu)=f_{0}(x)+\sum_{i=1}^{m} \lambda_{i} f_{i}(x)+\sum_{i=1}^{p} \nu_{i} h_{i}(x)$ • weighted sum of objective and constraint functions • $$\lambda_i$$ is Lagrange multiplier associated with $$f_i(x) \le 0$$ • $$\nu_i$$ is Lagrange multiplier associated with $$h_i(x) = 0$$ Lagrange dual function $$g : \mathbf{R}^{m} \times \mathbf{R}^{p} \rightarrow \mathbf{R}$$ \begin{aligned}g(\lambda, \nu) &=\inf_{x \in \mathcal{D}} L(x, \lambda, \nu) \\&=\inf_{x \in \mathcal{D}}\left(f_{0}(x)+\sum_{i=1}^{m} \lambda_{i} f_{i}(x)+\sum_{i=1}^{p} \nu_{i} h_{i}(x)\right)\end{aligned} $$g$$ is concave. lower bound property: if $$\lambda \succeq 0$$, then $$g(\lambda, \nu) \leq p^{\star}$$ ## Lagrange Dual Problem $\begin{array}{ll}{\text { maximize }} & {g(\lambda, \nu)} \\ {\text { subject to }} & {\lambda \succeq 0}\end{array}$ • find best lower bound on $$p^\star$$, obtained from Lagrange dual function • a convex optimization problem; optimal value denoted $$d^\star$$ • $$\lambda, \nu$$ are dual feasible if $$\lambda \succeq 0, (\lambda, \nu) \in \operatorname{dom} g$$ • often simplified by making implicit constraint $$(\lambda, \nu) \in \operatorname{dom} g$$ explicit ## Optimality Conditions • Weak duality $$d^\star \le p^\star$$ always holds, can be expressed as $\sup_{\lambda \succeq 0} \inf_{x} L(x, \lambda) \leq \inf_x \sup_{\lambda \succeq 0} L(x, \lambda)$ • Strong duality $$d^\star = p^\star$$ usually holds for convex problems. It means that $$x^\star$$ and $$\lambda^\star$$ from a saddle-point for the Lagrangian. ### Slater's Constraint Qualification Strong duality holds for a convex problem if it is strictly feasible, i.e., $\exists x \in \operatorname{int} \mathcal{D} : \qquad f_{i}(x)<0, \quad i=1, \ldots, m, \qquad A x=b$ ### KKT Conditions If strong duality holds and $$x^\star, \lambda^\star, \nu^\star$$ are optimal, then they must satisfy: 1. primal constraints: $$f_{i}(x^\star) \leq 0,\ h_{i}(x^\star)=0$$ 2. dual constraints: $$\lambda^\star \succeq 0$$ 3. complementary slackness: $$\lambda_i^\star f_i(x^\star) = 0$$ 4. gradient of Lagrangian with respect to $$x$$ vanishes: $\nabla f_{0}(x^\star)+\sum_{i=1}^{m} \lambda_{i}^\star \nabla f_{i}(x^\star)+\sum_{i=1}^{p} \nu_{i}^\star \nabla h_{i}(x^\star)=0$ If $$\tilde{x}, \tilde{\lambda}, \tilde{\nu}$$ satisfy KKT for a convex problem, then they are optimal # Applications ## Geometric Problems ### Minimum Volume Ellipsoid Around a Set Minimum volume ellipsoid $$\mathcal{E}$$ of a set $$C$$. • parametrize $$\mathcal{E}$$ as $$\mathcal{E}=\left\{v \mid\|A v+b\|_{2} \leq 1\right\}$$, assume $$A \in \mathbf{S}_{++}^n$$ • $$\operatorname{vol} \mathcal{E}$$ is proportional to $$\det A^{-1}$$, can compute $$\mathcal{E}$$ by solving $\begin{array}{ll}{\operatorname{minimize}}\ (\text {over } A, b) & {\log \operatorname{det} A^{-1}} \\ {\text {subject to }} & {\sup _{v \in C}\|A v+b\|_{2} \leq 1}\end{array}$ ### Maximum Volume Inscribed Ellipsoid Maximum volume ellipsoid $$\mathcal{E}$$ inside a convex set $$C \subseteq \mathbf{R}^n$$ • parametrize $$\mathcal{E}$$ as $$\mathcal{E}=\left\{B u+d \mid\|u\|_{2} \leq 1\right\}$$, assume $$B \in \mathbf{S}_{++}^n$$ • $$\operatorname{vol} \mathcal{E}$$ is proportional to $$\det B$$, can compute $$\mathcal{E}$$ by solving $\begin{array}{ll}{\text { maximize }} & {\log \operatorname{det} B} \\ {\text { subject to }} & {\sup _{\|u\|_{2} \leq 1} I_{C}(B u+d) \leq 0}\end{array}$ where $$I_C(x) = 0$$ for $$x\in C$$ and $$I_C(x) = \infty$$ for $$x\notin C$$ ### Linear Discrimination Separate two sets of points $$\{x_1, \dots, x_N\},\ \{y_1, \dots, y_M\}$$ by a hyperplane: $a^{T} x_{i}+b>0, \ i=1, \ldots, N, \qquad a^{T} y_{i}+b<0, \ i=1, \ldots, M$ homogeneous in $$a, b$$, hence equivalent to $a^{T} x_{i}+b \geq 1, \quad i=1, \ldots, N, \qquad a^{T} y_{i}+b \leq-1, \quad i=1, \ldots, M$ To separate two sets of points by maximum margin $\begin{array}{cl}{\operatorname{minimize}} & {(1 / 2)\|a\|_{2}} \\ {\text { subject to }} & {a^{T} x_{i}+b \geq 1, \quad i=1, \ldots, N} \\ {} & {a^{T} y_{i}+b \leq-1, \quad i=1, \ldots, M}\end{array}$ ### Support Vector Classifier $\begin{array}{cl}{\text { minimize }} & {\|a\|_{2}+\gamma\left(\mathbf{1}^{T} u+\mathbf{1}^{T} v\right)} \\ {\text { subject to }} & {a^{T} x_{i}+b \geq 1-u_{i}, \quad i=1, \ldots, N} \\ {} & {a^{T} y_{i}+b \leq-1+v_{i}, \quad i=1, \ldots, M} \\ {} & {u \succeq 0, \quad v \succeq 0}\end{array}$ produces point on trade-off curve between inverse of margin $$2/\|a\|_2$$ and classification error, measured by total slack $$\mathbf{1}^{T} u+\mathbf{1}^{T} v$$ ## Data Fitting ### Norm Approximation $\operatorname{minimize} \|A x-b\|$ where $$A \in \mathbf{R}^{m \times n}$$ with $$m\ge n$$ Linear measurement model: $$y = Ax + v$$, $$y$$ are measurements, $$x$$ is unknown, $$v$$ is measurement error. Given $$y=b$$, best guess of $$x$$ is $$x^\star$$ ### Least-Norm Problems $\begin{array}{ll}{\text { minimize }} & {\|x\|} \\ {\text { subject to }} & {A x=b}\end{array}$ where $$A \in \mathbf{R}^{m \times n}$$ with $$m\le n$$ ### Scalarized Problem $\operatorname{minimize} \|A x-b\|+\gamma\|x\|$ tradeoff between error and norm ## Statistical Estimation ### Maximum Likelihood Estimation $\operatorname{maximize}\ (\text{over } x ) \quad \log p_{x}(y)$ With linear measurement model with IID noise: $$y_{i}=a_{i}^{T} x+v_{i}, \ i=1, \ldots, m$$, the estimation problem becomes $\operatorname{maximize}\ l(x) = \sum_{i=1}^{m} \log p\left(y_{i}-a_{i}^{T} x\right)$ where $$y$$ is observed value, $$p$$ is the PDF of the measurement noise $$v$$ • Gaussian noise $$\mathcal{N}\left(0, \sigma^{2}\right) : p(z)=\left(2 \pi \sigma^{2}\right)^{-1 / 2} e^{-z^{2} /\left(2 \sigma^{2}\right)}$$ $l(x)=-\frac{m}{2} \log \left(2 \pi \sigma^{2}\right)-\frac{1}{2 \sigma^{2}} \sum_{i=1}^{m}\left(a_{i}^{T} x-y_{i}\right)^{2}$ • Laplacian noise $$p(z)=(1 /(2 a)) e^{-|z| / a}$$ $l(x)=-m \log (2 a)-\frac{1}{a} \sum_{i=1}^{m}\left|a_{i}^{T} x-y_{i}\right|$ • Uniform noise on $$[-a, a]$$ $l(x)=\left\{\begin{array}{ll}{-m \log (2 a)} & {\left|a_{i}^{T} x-y_{i}\right| \leq a, \quad i=1, \ldots, m} \\ {-\infty} & {\text { otherwise }}\end{array}\right.$ ### Logistic Regression Random variable $$y\in \{0,1\}$$ with distribution $p=\operatorname{prob}(y=1)=\frac{\exp \left(a^{T} u+b\right)}{1+\exp \left(a^{T} u+b\right)}$ log-likelihood function (for $$y_1 = \cdots = y_k = 1, \ y_{k+1} = \cdots = y_m = 0$$): \begin{aligned} l(a, b) &=\log \left(\prod_{i=1}^{k} \frac{\exp \left(a^{T} u_{i}+b\right)}{1+\exp \left(a^{T} u_{i}+b\right)} \prod_{i=k+1}^{m} \frac{1}{1+\exp \left(a^{T} u_{i}+b\right)}\right) \\ &=\sum_{i=1}^{k}\left(a^{T} u_{i}+b\right)-\sum_{i=1}^{m} \log \left(1+\exp \left(a^{T} u_{i}+b\right)\right) \end{aligned} concave in $$a, b$$ ]]> <p>AA/EE/ME 578 Review</p> Machine Learning for Big Data https://silencial.github.io/machine-learning-for-big-data/ 2019-06-16T00:00:00.000Z 2019-06-16T00:00:00.000Z CSE 547 Review # Frequent Itemset Mining ## Market-Basket Model We have a large set of items: Things sold in a supermarket. A large set of basket where each is a small subset of items: Things one customer buys on one day. Our goal is to find association rules: People who bought $$I = \{i_1, \dots, i_k\}$$ ten to buy $$j$$, denoted by $$I \rightarrow j$$. ## Frequent Itemsets Support for itemset $$I$$: Number of baskets containing all items in $$I$$. Support threshold $$s$$: Sets of items that appear in at least $$s$$ baskets are called frequent itemsets. Confidence of association rule is the probability of $$j$$ given $$I = \{i_1, \dots, i_k\}$$: $\text{conf}(I \rightarrow j) = \frac{\text{support}(I \cup j)}{\text{support(I)}}$ Since not all high-confidence rules are interesting, we define interest of an association rule $$I \rightarrow j$$: $\text{Interest}(I \rightarrow j) = | \text{conf}(I \rightarrow j) - \Pr[j] |$ ## Mining Association Rules 1. Find all frequent itemsets $$I$$. 2. For every subset $$A$$ of $$I$$, generate a rule $$A \rightarrow I\setminus A$$. 3. Compute the rule confidence. ==Finding frequent itemsets is the most challenging part.== ## Finding Frequent Itemset ### A-Priori Algorithm 1. Read baskets and count the # of occurrences of each item. 2. Find frequent items that appear $$\ge s$$ times. 3. Read baskets again and keep track of only those pairs where both elements are frequent. ### PCY Algorithm In pass 1 of A-Priori, most memory is idle. We can use this to reduce memory required in pass 2. • In addition to item counts, maintain a hash table with as many buckets as fit in memory. • Keep a count for each bucket into which pairs of items are hashed. • For a bucket with total count $$\le s$$, none of its pairs can be frequent. A-Priori, PCY, etc. take $$k$$ passes to find frequent itemsets of size $$k$$. We can use 2 or fewer passes for all sizes, but may miss some frequent itemsets ### Random Sampling • Take a random sample of the baskets and run typical algorithm. • Verify the candidate pairs are truly frequent in the entire data set by a second pass. • We can use smaller threshold to catch more truly frequent itemsets. ### Son Algorithm An itemset cannot be frequent in the entire set of baskets unless it is frequent in at least one subset. • Repeatedly read small subsets of the baskets into main memory and run an in-memory algorithm to find all frequent itemsets. • Count all the candidate itemset and determine which are frequent in the entire set. ### Toivonen's Algorithm • Start with a random sample, but lower the threshold slightly for the sample. • Find frequent itemsets in the sample. • Add negative border to the these itemsets. An itemset is in the negative border if it is not frequent but all its immediate subsets are. Immediate subset = delete exactly one element. • If no itemset from the negative border turns out to be frequent, then we found all the frequent itemsets. If we find one, we must start over with another sample. # Finding Similar Items Given high dimensional data points $$x_1, x_2, \dots$$ and some distance function $$d(x_1, x_2)$$, our goal is to find all pairs of data points $$(x_i, x_j)$$ that satisfy $$d(x_1, x_2) \le s$$. ## Shingling Convert a document into a set representation. A k-shingle for a document is a sequence of k tokens that appears in the doc. Then we can transfer sets to a shingles-documents Boolean matrix. The element in row $$e$$ and column $$s$$ is 1 iff document $$s$$ contains shingle $$e$$. Suppose document $$D_1$$ is represented by a set of its k-shingles $$C_1 = S(D_1)$$, then a natural similarity measure is the Jaccard similarity: $$\text{sim}(D_1, D_2) = |C_1 \cap C_2| / |C_1 \cup C_2|$$. Jaccard distance: $$d(C_1, C_2) = 1 - \text{sim}(D_1, D_2)$$ ## Min-Hashing Convert large sets to short signatures, while preserving similarity. Find a hash function $$h$$ such that $$\Pr[h(C_1) = h(C_2)] = \text{sim}(C_1, C_2)$$. For the Jaccard similarity, this function is called Min-Hashing. • Permute the rows of the Boolean matrix with a permutation $$\pi$$. • Define minhash function $$h_\pi (C) = \min_\pi \pi(C)$$ as the number of the first row in which column $$C$$ has value 1. • Repeat this process for different permutations to create a signature for each column. ## Locality-Sensitive Hashing • Divide signature matrix $$M$$ into $$b$$ bands of $$r$$ rows. • For each band, has its portion of each column to a hash table with $$k$$ buckets ($$k$$ is large enough). • Choose candidate pairs that hash to the same bucket for at least 1 band. Suppose $$\text{sim}(C_1, C_2) = t$$, then the prob that no band identical is $$(1 - t^r)^b$$. We can get the S-curve by using LSH: ### Extension A family $$H$$ of hash functions is said to be $$(d_1, d_2, p_1, p_2)$$-sensitive if for any $$x$$ and $$y$$ in $$S$$: • if $$d(x, y) \le d_1$$, then for any $$h \in H$$, $$\Pr[h(x) = h(y)] \ge p_1$$. • if $$d(x, y) \ge d_1$$, then for any $$h \in H$$, $$\Pr[h(x) = h(y)] \le p_2$$. Rows/Bands techniques are just AND/OR operations: • AND: if $$H$$ is $$(d_1, d_2, p_1, p_2)$$-sensitive, then $$H'$$ is $$(d_1, d_2, p_1^r, p_2^r)$$-sensitive. • OR: if $$H$$ is $$(d_1, d_2, p_1, p_2)$$-sensitive, then $$H'$$ is $$(d_1, d_2, 1 - (1 - p_1)^b, 1 - (1 - p_2)^b)$$-sensitive. We can use any sequence of AND's and OR's to get the best S-curve. ### Other Distance Metrics For cosine distance, Random Hyperplanes method is a $$(d_1, d_2, (1 - d_1 / \pi), (1 - d_2 / \pi))$$-sensitive family for any $$d_1$$ and $$d_2$$. Random hyperplanes: Each vector $$v$$ determines a hash function $$h_v$$ such that $$h_v(x) = 1$$ if $$v\cdot x\ge 0$$, or $$= -1$$ if $$v\cdot x < 0$$. For Euclidean distance, Project on lines method is a $$(a/2, 2a, 1/2, 1/3)$$-sensitive family for any bucket width $$a$$. Project on line: Partition a random line into buckets with width $$a$$, then project the point onto the line. Use bucket id as the hash value. # Clustering Given a set of points and distance measure, group the points into clusters so that members of a cluster are close to each other. ## Hierarchical Clustering • Agglomerative (bottom up): Each point is a cluster at first and then repeatedly combine the two "nearest" clusters into one. • Divisive (top down): Start with one cluster and recursively split it. To represent a cluster, for Euclidean case, we can simply use the average of points as the centroid. For non-Euclidean case, we can define clustroid to be the point "closest" to other points, where the "closest" can be measured in different ways. To find nearest clusters, we can use the distance from the centroid/clustroid, or other measures like the minimum distance between two points from each cluster, the diameter of the merged cluster, or the average distance between points in the cluster. Stop merging clusters when $$k$$ clusters are found (if we know the number of clusters), or criterion is met based on the merging criterion, or there is only 1 cluster left. ==The best choice depends on the shape of clusters.== ## $$k$$-Means Clustering Assumes Euclidean space. • Initialize $$k$$ cluster by picking one point each. • Place other points to their nearest cluster. • Update the centroids of $$k$$ clusters. • Repeat until convergence. Try different $$k$$ and look at the changes in the average distance to centroid. As $$k$$ increases, the average falls rapidly until right $$k$$, then changes little. ## BFR Algorithm BFR is a variant of $$k$$-means to handle very large data sets. It keeps summary statistics of groups of points to save memory but assumes clusters are normally distributed in a Euclidean space. 1. Initialize $$k$$ clusters. 2. Load in a bag of points from disk 3. Assign new points to one of the clusters within some distance threshold. 4. Cluster the remaining points to create new clusters. 5. Merge clusters created in 4 with existing clusters. 6. Repeat 2-5 until all points are examined. We need to keep track of 3 sets of points: • Discard set (DS): Points close enough to a centroid to be summarized. • Compression set (CS): Groups of points that are close together but not close to any existing centroid. • Retained set (RS): Isolated points waiting to be assigned to a CS. The statistics we need to keep: • $$N$$: The number of points. • $$SUM$$: A vector whose $$i$$-th component is the sum of coordinates in the $$i$$-th dimension. • $$SUMSQ$$: A vector whose $$i$$-th component is the sum of squares of coordinates in $$i$$-th dimension. Closeness measurement: Mahalanobis distance: Normalized Euclidean distance from centroid. Combining criterion: The variance of the combined cluster is below some threshold. ## CURE Algorithm CURE (Clustering Using REpresentatives) only assumes a Euclidean distance and allows clusters to be any shape. • Pick a random sample of points that fit in main memory. • Cluster these points hierarchically. • For each cluster, pick a sample of points, as dispersed as possible. Moving them $$20\%$$ toward the centroid of the cluster to get the representatives. • Rescan the whole dataset and for every point $$p$$, find the closest representative and assign $$p$$ to its cluster. # Dimensionality Reduction ## SVD Pros: • Optimal low-rank approximation in terms of Frobenius norm. Cons: • A singular vector specifies a linear combination of all input columns or rows. • Lack of sparsity: singular vectors are dense. ## CUR Decomposition It is common for the matrix $$A$$ that we wish to decompose to be very sparse, but $$U$$ and $$V$$ from SVD decomposition are not. CUR decomposition solves this problem by using only randomly chosen rows and columns of $$A$$. $A \approx C\cdot U\cdot R$ Where $$C$$ and $$R$$ are random picking columns and rows, respectively. $$U$$ is the pseudo-inverse of the intersection of $$C$$ and $$R$$. To decrease the error $$\|A - C\cdot U\cdot R\|_F$$, rows and columns must be picked with the probabilities proportional to importance: the square of the Frobenius norm of a row/column. Pros: • Easy interpretation: Basis vectors are actual columns and rows. • Sparsity. Cons: • There will be duplicate rows and columns. # Recommendation System ## Content-Based Recommend items to customer $$x$$ similar to previous items rated highly by $$x$$. • Build user and item profiles. • Compute the cosine similarity between user and item. Pros: • No need for data on other users. • Able to recommend new & unpopular items. Cons: • Finding the appropriate features to create the profile is hard. • Cannot build profile for new users. • Never recommends items outside user's content profile. • Unable to exploit quality judgements of other users. ## Collaborative Filtering User-User CF: For user $$x$$, find similar users based on the rating and estimate $$x$$'s rating on item $$i$$ as the weighted average of these users' ratings on item $$i$$. Item-Item CF: similar. Similarity metric: Pearson correlation coefficient. $s_{xy} = \frac{\sum_{s \in S_{xy}} (r_{xs} - \overline{r_x}) (r_{ys} - \overline{r_y})}{\sqrt{\sum_{s \in S_{xy}} (r_{xs} - \overline{r_x})^2} \sqrt{\sum_{s \in S_{xy}} (r_{ys} - \overline{r_y})^2}}$ where $$s_{xy}$$ is the similarity between user $$x$$ and $$y$$, $$S_{xy}$$ is the set of items that are rated by both users $$x$$ and $$y$$. $$\overline{r_x}, \overline{r_y}$$ are the average rating of $$x, y$$. Notice that this is the cosine similarity when data is centered at 0. Pros: • No feature selection needed, works for any kind of item. Cons: • Cold Start problem. • The user/rating matrix is sparse and hard to find similar user/item. • Cannot recommend items that has not been previously rated. • Tends to recommend popular items. Cannot recommend items to someone with unique taste. ==In practice, item-item CF often works better since items are simpler, users have multiple tastes.== ## Latent Factor Model SVD gives minimum reconstruction error (Sum of Squared Errors). Since the rating matrix has missing entries, we need to apply SVD only on the data presented: $\min_{P, Q} \sum_{(i,x) \in R} (r_{xi} - q_i p_x)^2$ To prevent overfitting, we add the regularization term: $\min_{P, Q} \sum_{(i,x) \in R} (r_{xi} - q_i p_x)^2 + \lambda_1 \sum_x \|p_x\|^2 + \lambda_2 \sum_i \|q_i\|^2$ Taking bias into account: $\min_{P, Q} \sum_{(i,x) \in R} \big(r_{xi} - (\mu + b_x + b_i + q_i p_x)\big)^2 + \lambda_1 \sum_x \|p_x\|^2 + \lambda_2 \sum_i \|q_i\|^2 + \lambda_3 \sum_x \|b_x\|^2 + \lambda_4 \sum_i \|b_i\|^2$ Where $$\mu$$ is the overall mean rating, $$b_x$$ is the bias for user $$x$$, $$b_i$$ is the bias for movie $$i$$. Add time dependence to biases: $r_{xi} = \mu + b_x(t) + b_i(t) + q_i p_x$ To solve this optimization problem, we can use SGD (Stochastic Gradient Descent) or ALS (Alternating Least Squares). # PageRank ## Random Surfer Start at a random page and follow random out-links repeatedly. Compute the PageRank $$r_j$$ as the probability of being at a page: $r_j = \sum_{i \rightarrow j} \frac{r_i}{d_i}$ where $$d_i$$ is the out-degree of node $$i$$. We can create the column stochastic matrix $$M$$ where $$M_{ji} = 1/d_i$$, then $$Mr = r$$. We can solve $$r$$ by power iteration. ## Google Formulation Problems: • Dead ends: Some pages have no out-links and cause importance to leak out. • Spider traps: Out-links are within the group and will absorb all importance. Solution: Teleports with probability $$\beta$$: $r_j = \sum_{i \rightarrow j} \beta \frac{r_i}{d_i} + (1 - \beta) \frac{1}{N}$ # Community Detection in Graphs ## Approximate PPR Algorithm • Pick a seed node $$s$$ of interest. • Run PPR with teleport set=$$\{s\}$$. • Sort the nodes by the decreasing PPR score • For each $$i$$ compute $$\phi(A_i = \{r_1, \dots, r_i\})$$ and find the local minima of $$\phi(A_i)$$. Compute PPR (Personalized PageRank): \mathbf{\text{ApproxPageRank}}(S, \beta, \epsilon):\qquad\qquad\qquad \\\begin{aligned}&\text{Set } r=\vec{0}, q=[0,\dots, 1\dots, 0] \\&\text{While } \max_{u \in V} \frac{q_u}{d_u} \ge \epsilon: \\&\qquad \text{Choose any vertex } u \text{ where} \frac{q_u}{d_u} \ge \epsilon \\&\qquad \mathbf{\text{Push}}(u, r, q): \\&\qquad\qquad r' = r, q' = q \\&\qquad\qquad r'_u = r_u + (1 - \beta) q_u \\&\qquad\qquad q'_u = \frac{1}{2} \beta q_u \\&\qquad\qquad \text{For each } v \text{ such that } u \rightarrow v: \\&\qquad\qquad\qquad q'_v = q_v + \frac{1}{2}\beta q_u/d_u \\&\qquad\qquad r=r', q=q' \\&\quad \text{Return } r\end{aligned} The key of this algorithm is to keep track of the residual PageRank $$q$$ and push it to $$r$$ until $$q$$ is small. Good cluster definition: • Maximize the number of within-cluster connections. • Minimize the number of between-cluster connections. Graph Cut is the set of edges (edge weights) with only one node in the cluster: $\operatorname{cut}(A) = \sum_{i\in A, j\notin A} w_{ij}$ Conductance is the connectivity of the group to the rest of the network relative to the density of the group. $\phi(A)=\frac{|\{(i, j) \in E ; i \in A, j \notin A\}|}{\min (\operatorname{vol}(A), 2 m-\operatorname{vol}(A))}$ where $$\operatorname{vol}(A)$$ is the total weight of the edges with at least one endpoint in $$A$$, $$\operatorname{vol}(A) = \sum_{i \in A} d_i$$. ## Modularity Maximization Modularity $$Q$$ is a measure of how well a network is partitioned into communities. Given a partitioning of the network into groups $$s \in S$$: $Q \propto \sum_{s \in S}[(\# \text{ edges within groups}) - (\text{expected }\# \text{ edges within group } s) ]$ Given graph $$G$$ on $$n$$ nodes and $$m$$ edges, the expected number of edges between nodes $$i$$ and $$j$$ of degrees $$k_i$$ and $$k_i$$ and $$k_j$$ equals to $$\frac{k_i k_j}{2m}$$. $Q = \frac{1}{2m} \sum_{s\in S}\sum_{i\in s}\sum_{j\in s}\left(A_{ij} - \frac{k_i k_j}{2m}\right)$ $$Q$$ is in range $$[-1, 1]$$. Greater than $$0.3$$-$$0.7$$ means significant community structure. # Graph Representation Learning Modern deep learning toolbox is designed for simple sequences or grids. But graph has complex topographical structure so we need to encode nodes to the embedding space where the similarity between nodes are preserved. • Define an encoder (a mapping from nodes to embeddings). • Define a node similarity function. • Optimize the parameters of the encoder so that $$\text{similarity}(u,v) \approx z_v^T z_u$$. ## Random Walk Embeddings The similarity between nodes $$u$$ and $$v$$ are defined as the probability that $$u$$ and $$v$$ co-occur on a random walk over the network. • Estimate probability of visiting node $$v$$ on a random walk starting from node $$u$$ using some random walk strategy $$R$$: $$P_R(v|u)$$. • Optimize embedding to encode these random walk statistics: $$\theta \propto P_R(v|u)$$. ## Unsupervised Feature Learning Learn node $$u$$'s embedding such that nearby nodes $$N_R(u)$$ are close together in the network. Given $$G=(V, E)$$, learn a mapping $$z: u\rightarrow \mathbb{R}^d$$ such that $\max_{\mathbf{z}} \sum_{u \in V} \log \mathrm{P}\left(N_{\mathrm{R}}(u) | z_{u}\right)$ With random walk optimization: $\log \mathrm{P}\left(N_{\mathrm{R}}(u) | z_{u}\right)=\sum_{v \in N_{R}(u)} \log \mathrm{P}\left(\mathrm{z}_{v} | z_{u}\right)$ With softmax parametrization: $\mathrm{P}\left(z_{v} | z_{u}\right)=\frac{\exp \left(z_{v} \cdot z_u\right)}{\sum_{n \in V} \exp \left(z_{n} \cdot z_u\right)}$ Putting it all together: $\mathcal{L}=\sum_{u \in V} \sum_{v \in N_{R}(u)}-\log \left(\frac{\exp \left(\mathbf{z}_{u}^{\top} \mathbf{z}_{v}\right)}{\sum_{n \in V} \exp \left(\mathbf{z}_{u}^{\top} \mathbf{z}_{n}\right)}\right)$ Optimizing random walk embeddings = Finding node embeddings $$z$$ that minimize $$\mathcal{L}$$. Negative sampling \begin{aligned}&\log \left(\frac{\exp \left(\mathbf{z}_{u}^{\top} \mathbf{z}_{v}\right)}{\sum_{n \in V} \exp \left(\mathbf{z}_{u}^{\top} \mathbf{z}_{n}\right)}\right) \\&\approx \log \left(\sigma\left(\mathbf{z}_{u}^{\top} \mathbf{z}_{v}\right)\right)-\sum_{i=1}^{k} \log \left(\sigma\left(\mathbf{z}_{u}^{\top} \mathbf{z}_{n_{i}}\right)\right), n_{i} \sim P_{V}\end{aligned} Instead of normalizing w.r.t. all nodes, just normalize against 𝑘 random "negative samples" $$n_i$$. ## Node2vec Algorithm A flexible, biased random walks that can trade off between local and global views of the network. • Return parameter $$p$$: Return back to the previous node. • In-out parameter $$q$$: Moving outwards (DFS) vs. inwards (BFS). # Large-Scale Machine Learning ## Decision Tree A Decision Tree is a tree-structured plan of a set of attributes to test in order to predict the output. It split the data at each internal node and each leaf node makes a prediction. To construct a tree, we must figure out how to split and when to stop splitting. How to split: • Regression: Purity. Find split $$(X^{(i)}, v)$$ that creates $$D, D_L D_R$$ and maximizes|D| Var(D)-(|D_L| Var(D_L)+|D_R| Var(D_R ) ) , where $$Var(D) = \frac{1}{n}\sum_{i\in D}(y_i - \bar{y})^2$$. • Classification: Information Gain. $$IG(Y|X) = H(Y) - H(Y|X)$$. Where $$H$$ is the entropy, $$H(X) = -\sum_{j=1}^m p(X_j) \log p(X_j)$$, and $$H(Y|X) = \sum_j P(X = v_j) H(Y |X=v_j)$$. $$IG$$ tells us how much information about $$Y$$ is contained in $$X$$, so higher $$IG$$ means a good split. When to stop: • When the leaf is "pure": The target variable does not vary too much, $$Var(y) < \epsilon$$. • When # of examples in the leaf is too small. ## Support Vector Machine Given data $$(x_1, y_1), \dots, (x_n, y_n)$$ where $$y_i \in \{-1, +1 \}$$, we want to find a line $$y = wx+b$$ that separates these data. For $$i$$-th datapoint, $$\gamma_i = (w x_i + b) y_i$$ is the distance and we want to maximize the margin: $\max_{w, b} \min_{i} \gamma_i$ This can be rewritten as: \begin{aligned}&\max\ \gamma \\&s.t.\ \forall i, y_{i}(w \cdot x_{i}+b) \geq \gamma\end{aligned} Work with normalized $$w$$ and require support vectors $$x_j$$ to be on the plane $$w x_j + b = \pm 1$$, we can get \begin{aligned}&\min\ \frac{1}{2} \|w\|^2 \\&s.t.\ \forall i, y_{i}(w \cdot x_{i}+b) \geq 1\end{aligned} Introduce penalty for data that is not separable: \begin{aligned}&\min\ \frac{1}{2} \|w\|^2 + C\sum_{i=1}^n \xi_i \\&s.t.\ \forall i, y_{i}(w \cdot x_{i}+b) \geq 1 - \xi_i\end{aligned} Where $$\xi$$ is the slack variable and $$C$$ is the slack penalty. When $$C=\infty$$, it strictly separate the data; when $$C=0$$, it ignores the data. The natural form of SVM is $\mathop{\arg\min}_{w,b} \frac{1}{2} w\cdot w + C \sum_{i=1}^n \max\{0, 1-y_i(w x_i + b)\}$ Use SGD to solve this problem. # Mining Data Streams In many data mining situations, we do not know the entire data set in advance. We can think of the data as infinite and non-stationary. SGD is an example of a stream algorithm. In Machine learning it is called Online Learning. • Allows for modeling problems where we have a continuous stream of data. • Slowly adapt to the changes in data. Types of queries: • Random sampling from a stream • Queries over sliding windows • Filtering a data stream • Counting distinct element • Estimating moments • Finding frequent elements. ## Sampling Sampling a fixed proportion To get a sample of $$a/b$$ fraction of the stream: • Hash each tuple's key uniformly into $$b$$ buckets • Pick the tuple if it is hashed to the first $$a$$ buckets Sampling a fixed-size sample Property: For all time steps k, each of elements seen so far has equal prob. of being sampled. Reservoir sampling • Store all the first $$s$$ elements of the stream to $$S$$ • Suppose we have seen $$n-1$$ elements, and now the $$n$$-th element arrives ($$n\ge s$$) • With probability $$s/n$$, uniformly pick a random element in $$S$$ and replace it by the $$n$$-th element, else discard it ## Queries Over a Sliding Window Queries are about a window of length $$N$$ — the $$N$$ most recent elements received. Given a stream of $$0$$s and $$1$$s, how many $$1$$s are in the last $$k$$ bits? Approximate answer is ok since we cannot afford to store $$N$$ bits. If the stream is uniform distributed, we can simply count the total number of $$0$$s: $$Z$$ and $$1$$s: $$S$$ and get the result: $$N \frac{S}{S+Z}$$. If it is non-uniform, DGIM method gives answer with accuracy higher than $$50\%$$. Summarize blocks with specific number of $$1$$s and let the block size increase exponentially. When a new bit comes in, drop the oldest bucket if its end-time is prior to $$N$$ time units before the current time. If the current bit is $$0$$, then no other changes are needed, if it is $$1$$, create a new bucket of size $$1$$. If there are now three buckets of size $$1$$, combine the oldest two into a bucket of size $$2$$. And continue to combine. When querying, sum the sizes of all buckets but the oldest, add half the size of the last bucket. This method can be extended to count the sum of the last $$k$$ elements of a stream of positive integers. Instead of maintaining $$1$$ or $$2$$ of each size bucket, we can do $$r-1$$ or $$r$$. The error is at most $$\mathcal{O}(1/r)$$. By picking $$r$$, we can tradeoff between memory and error. ## Filtering Given a list of keys $$S$$, determine which tuples of stream are in $$S$$. Hash table: • Create a bit array $$B$$ of $$n$$ bits, initially all $$0$$s. • Choose a hash function $$h$$ with range $$[0, n)$$. • For every = $$s \in S$$, set $$B[h(s)] = 1$$. • Hash each element $$a$$ of the stream and output $$a$$ iff $$B[h(a)]=1$$. Suppose $$|S| = m$$, then the probability of false positive equals to the fraction of $$1$$s in the array $$B$$ $1-\left(1 - \frac{1}{n}\right)^m \approx 1- e^{-m/n}$ Bloom filter use $$k$$ different hash functions and output $$a$$ iff it hashes to $$1$$ for every hash function. Now the false positive rate is $$(1 - e^{-km/n})^k$$. The optimal value of $$k$$ is $$n/m \ln 2$$. ## Counting Distinct Element Flajolet-Martin approach: • Pick a hash function $$h$$ that maps each of the $$N$$ elements to at least $$\log_2 N$$ bits. • For each stream element $$a$$, let $$r(a)$$ be the number of trailing $$0$$s in $$h(a)$$. • Let $$R = \max_a r(a)$$, and estimate the number of distinct elements to be $$2^R$$. But $$E[2^R]$$ is actually infinite since as the probability halves when $$R \rightarrow R+1$$, the value doubles. Workaround is to use many hash functions $$h_i$$ and get many samples. Partition the samples into small groups, take the median of groups and then take the average of the medians. ## Computing Moments Let $$m_i$$ be the number of times value $$i$$ occurs in the stream, the $$k$$-th moment is $\sum_{i \in A}\left(m_{i}\right)^{k}$ AMS method gives an unbiased estimate for all moments. Take $$2$$-nd moment for example • Pick some random time $$t$$, stream have item $$X$$. • Maintain count $$c$$ of the number of $$X$$s in the stream starting from the chosen time $$t$$. • The estimate of the $$2$$-nd moment is $$S = f(X) = n(2c - 1)$$. • Since we have $$(X_1, X_2, \dots, X_k)$$, the estimate will be $$S = \sum f(X_i) / k$$. For estimating $$k$$-th moment we use the same algorithm but change $$f(X)$$. ]]> <p><a href="https://courses.cs.washington.edu/courses/cse547/19sp/">CSE 547</a> Review</p> Regular Expression https://silencial.github.io/regex/ 2019-05-24T00:00:00.000Z 2021-03-21T00:00:00.000Z Regular Expression Tutorial Regex 101: 在线测试正则表达式的网站 # 匹配 • ^: 匹配输入字符串开始位置 • : 匹配输入字符串结束位置
• \b: 匹配单词边界。例如 er\b: 可以匹配 "never" 中的 "er" 而不能匹配 "verb" 中的 "er"
• \B: 匹配非单词边界，与 \b 相反
• \n: 匹配换行符
• \r 匹配回车符
• \t: 匹配制表符
• \s: 匹配任何空白字符。包括空格、制表符、换页符等。
• \S: 匹配任何非空白字符
• .: 匹配除 \r\n 以外的任何单个字符
• \d: 匹配数字字符
• \D: 匹配非数字字符
• \w: 匹配包括下划线在内的任何字母与数字字符
• \W: 与 \w 相反
• [\u4e00-\u9fa5]: 匹配中文字符
• [^\x00-\xff]: 匹配双字节字符（包括中文）

# 重复

• *: 匹配前面的表达式零次或多次
• +: 匹配前面的表达式一次或多次
• ？: 匹配前面的表达式零次或一次
• {n}: 匹配前面的表达式 n 次
• {n,}: 匹配前面的表达式至少 n 次
• {n,m}: 匹配前面的表达式最少 n 次，最多 m 次

# 特殊

• \: 转义字符，将下一个字符标记为一个特殊字符
• ？: 非贪心量化。当该字符跟在任何一个重复修饰符后时，匹配尽可能少的字符串
• (pattern): 匹配并获取字符串，用于向后引用。默认情况下，每个分组按照从左往右的规则依次分配组号为 1，2，3 等。之后使用 \1 来代表分组 1 匹配的文本。
• (?:pattern): 非获取匹配，括号只做优先级分组
• (?<name>pattern): 匹配并将组名设置为 name
• (?=pattern): 匹配 pattern 之前的位置
• (?!pattern): 匹配非 pattern 之前的位置
• (?<=pattern): 匹配 pattern 之后的位置
• (?<!pattern): 匹配非 pattern 之后的位置
• |: 或
• [xyz]: 匹配所包含的任意一个字符
• [^xyz]: 匹配未列出的任意字符
• [a-z]: 匹配指定范围内的任意字符
• [^a-z]: 匹配未在指定范围内的任意字符
]]>
<p>Regular Expression Tutorial</p>
Algorithm I https://silencial.github.io/algorithms-1/ 2019-02-10T00:00:00.000Z 2020-07-10T00:00:00.000Z Review of Princeton Algorithms I course on Coursera.

Algorithms II Review

My solution to the homework Course Overview

topicdata structures and algorithms
data typesstack, queue, bag, union-find, priority queue
sortingquicksort, mergesort, heapsort
searchingBST, red-black BST, hash table
graphsBFS, DFS, Prim, Kruskal, Dijkstra
stringsradix sorts, tries, KMP, regexps, data compression
advancedB-tree, suffix array, maxflow

# Union-Find

Given a set of N objects, we want to implement the following two function.

• Union: connect two objects.
• Find: is there a path connecting the two objects? initializeunionconnected
Quick-find$$N$$$$N$$$$1$$
Quick-union$$N$$$$N$$$$N$$
Weighted QU$$N$$$$\lg N$$$$\lg N$$

## Quick-Find

1. Initialization: integer array id[] of length N
2. Union(p, q): change all entries whose id equal id[p] to id[q]
3. Find: check if p and q have the same id.
• Union too expensive ($$N$$ array accesses).
• Trees are flat, but too expensive to keep them flat.

## Quick-Union

1. Initialization: integer array id[] of length N. id[i] is parent of i
2. Union(p, q): change the root id of p to the root id of q
3. Find: check if p and q have the same root.
• Find too expensive (could be $$N$$ array accesses).
• Trees can get tall.

## Weighted Quick-Union

1. Initialization: addition array sz[i] to count number of objects in the tree rooted at i
2. Union(p, q): change the root of smaller tree to root of larger tree and update sz[]
3. Find: same as quick-union
4. ==Path compression: we can add extra line of code to make the tree more flat when computing the root of a node==
• Find takes time proportional to depth of $$p$$ and $$q$$.
• Union takes constant time, given roots.
• Depth of any node is at most $$\lg N$$.

## Applications

• Pixels in a digital photo.
• Computers in a network.
• Friends in a social network.
• Transistors in a computer chip.
• Elements in a mathematical set.
• Variable names in Fortran program.
• Metallic sites in a composite system.

## HW1 Percolation

Specification

• $$N$$-by-$$N$$ grid of sites.
• Each site is open with probability $$p$$.
• System percolates iff top and bottom are connected by open sites. When $$N$$ is large, theory guarantees a sharp threshold $$p^*$$. What is the value of $$p^*$$?

Use Monte Carlo simulation:

• Initialize $$N$$-by-$$N$$ whole grid to be blocked.
• Declare random sites open until top connected bottom.
• Vacancy percentage estimates $$p^*$$.

How to check whether an $$N$$-by-$$N$$ system percolates:

• Create an object for each site and name them $$0$$ to $$N^2 - 1$$.
• Sites are in same component if connected by open sites.
• Percolates iff any site on bottom row is connected to site on top row (brute-force algorithm: $$N^2$$ calls to connected()).
• ==Clever trick: introduce $$2$$ virtual sites connected to top and bottom separately and check if virtual top site is connected to virtual bottom site.== (only $$1$$ call to connected()) Virtual top/bottom backwash problem:

• When the system percolates (virtual top is connected to virtual bottom), every node that is connected to the bottom is also considered to be connected to the top (even though it is not). This is the backwash problem.
• A simple way to solve this problem is to use two WeightedQuickUnionUF object, one with the virtual bottom and the other not. Use the first to decide whether the system percolates and the second to see whether a node is full (connected to the top).
• A more memory-efficient way to solve this is to manually maintain two boolean arrays, contop[] and conbot[] to keep track whether the root of a node is connected to the top/bottom. When a node is connected to the top and the bottom at the same time, the system percolates.

Solution Percolation.java

# Bags, Queues, and Stacks

• Operations: insert, remove, iterate.
• Stack: LIFO (last in first out).
• Queue: FIFO (first in first out).
• Bag: Adding items to a collection and iterating (order doesn't matter)

## Stack

### Resizing-Array vs. Linked-List

Linked-list implementation:

• Every operation takes constant time in the worst case.
• Use extra time and space to deal with the links.

Resizing-array implementation:

• Every operation takes constant amortized time.
• Less wasted space.

## Generics

• Avoid casting in client.
• Discover type mismatch errors at compile-time instead of run-time.
• Client code can use generic stack for any type of data.

Take Stack as an example

## Iterators

• Support iteration over stack items by client, without revealing the internal representation of the stack.
• Implement the java.lang.Iterable interface supported by Java.

### Java Support

Java supports elegant client code if Iterable interface is implemented.

## Applications

• Parsing in a compiler.
• Java virtual machine.
• Undo in a word processor.
• Back button in a Web browser.
• PostScript language for printers.
• Implementing function calls in a compiler

### Two-Stack Algorithm

To evaluate infix expressions. $(\ 1\ +\ (\ (\ 2\ +\ 3\ )\ *\ (\ 4\ *\ 5\ )\ )\ )$

• Value: push onto the value stack.
• Operator: push onto the operator stack.
• Left parenthesis: ignore.
• Right parenthesis: pop operator and two values; push the result of applying that operator to those values onto the operand stack.
• Dijkstra's two-stack algorithm computes the same value if the operator occurs after the two values.

$(\ 1\ (\ (\ 2\ 3\ +\ )\ (\ 4\ 5\ *\ )\ *\ )\ +\ )$

• All of the parentheses are redundant.

$1\ 2\ 3\ +\ 4\ 5\ *\ *\ +$

## HW2 Deques and Randomized Queues

Specification

• Dequeue: A double-ended queue or deque is a generalization of a stack and a queue that support adding and removing items from either the front or the back of the data structure. Deque.java
• Randomized queue: A randomized queue is similar to a stack or queue, except that the item removed is chosen uniformly at random from items in the data structure. RandomizedQueue.java
• Client: Write a client program that takes an integer $$k$$ as a command-line argument; reads in a sequence of strings from standard input using StdIn.readString(); and prints exactly $$k$$ of them, uniformly at random. Permutation.java

Use reservoir sampling to randomly choose $$k$$ samples from a list of $$n$$ elements in a single pass over the items:

• Keep the first $$k$$ items
• When the $$i$$-th item arrives
• with probability $$k/i$$, discard an old item at random and keep the new one
• with probability $$1 - k/i$$, discard the new item

# Elementary Sorts

inplace?stable?worstaveragebestremarks
selection$$N^2/2$$$$N^2/2$$$$N^2/2$$$$N$$ exchanges
insertion$$N^2/2$$$$N^2/4$$$$N$$use for small $$N$$ or partially ordered
shell$$N$$tight code, subquadratic
merge$$N\lg N$$$$N\lg N$$$$N\lg N$$$$N\lg N$$ guarantee, stable
quick$$N^2/2$$$$2N\ln N$$$$N\lg N$$$$N\lg N$$ probabilistic guarantee fastest in practice
3-way quick$$N^2/2$$$$2N\ln N$$$$N$$improves quicksort in presence of duplicate keys
heap$$2N\lg N$$$$2N\lg N$$$$N\lg N$$$$N\lg N$$ guarantee, in-place

## Comparable

Interface:

• Built-in comparable types: Integer, Double, String, Date, File, ...
• User-defined comparable types. Implement the Comparable interface supported by Java.

Two useful sorting helper functions:

• Less: is item $$v$$ less than $$w$$?
• Exchange: swap item in array at index $$i$$ with the one at index $$j$$.

## Selection Sort

• Scan from left to right $$[i, N]$$
• Find the minimum entry and exchange with $$i$$'s entry
• Continue next scan from $$[i+1, N]$$

## Insertion Sort

• Scan $$i$$ from left to right $$[0, N]$$
• Scan $$j$$ from right to left $$[i, 0]$$, exchange a[j]` with each larger entry to its left
• Start next scan $$[1, N]$$

## Shellsort

• Insertion sort with stride length $$h$$.
• ==A $$g$$-sorted array remains $$g$$-sorted after $$h$$-sorting it.==
• Use a sequence of increment steps to shellsort the array.

Which increment sequence to use?

• Power of two: $$1, 2, 4, 8, 16, 32, \dots$$ No
• Power of two minus one: $$1, 3, 7, 15, 31, 63, \dots$$ Maybe
• $$3x + 1$$: $$1, 4, 13, 40, 121, 364, \dots$$ OK. Easy to compute
• Sedgewick: $$1, 5, 19, 41, 109, 209, 505, 929, 2161, 3905, \dots$$ Good. Tough to beat in empirical studies

## Shuffle

Rearrange array so that result is a uniform random permutation.

Shuffle sort

• Generate a random real number for each array entry.
• Sort the array.

Knuth shuffle

• In iteration $$i$$, pick integer $$r$$ between $$0$$ and $$i$$ uniformly at random.
• Swap $$a[i]$$ and $$a[r]$$.

## Convex Hull

The convex hull of a set of $$N$$ points is the smallest perimeter fence enclosing the points.

Convex hull applications

• Robot motion planning: Find shortest path in the plane from $$s$$ to $$t$$ to avoids a polygonal obstacle. Shortest path is either a straight line from $$s$$ to $$t$$ or it is one of two polygonal chains of convex hull. • Farthest pair problem: Given $$N$$ points in the plane, find a pair of points with the largest Euclidean distance between them. Farthest pair of points are extreme points on convex hull. Graham scan:

• Choose point $$p$$ with smallest $$y$$-coordinate.
• Sort points by polar angle with $$p$$.
• Consider points in order; discard unless it create a counterclockwise turn. # Mergesort

• Divide array into two halves.
• Recursively sort each half.
• Merge two halves.

Java assert statement can be enable or disable at runtime

## Practical Improvements

• Use insertion sort for small subarrays since mergesort has too much overhead for tiny subarrays.
• Stop if the biggest item in first half $$\le$$ smallest item in second half.
• Eliminate the copy to the auxiliary array by switching the role of the input and auxiliary array in each recursive call.