2026 春 · 期中考试复习

具身智能导论
一站式复习网页

覆盖 Lect01-08:从具身智能总览、机器人学基础、运动规划与控制,到视觉抓取、模仿学习、强化学习。未标记内容按课件整理;带“非课件内容”标签的部分是为了帮助第一次接触本课的同学建立理解框架。

8讲课件
30+自测题
12复习分区
1A4 双面 Cheat Sheet
使用说明
1. 本页中未额外标记的知识点,按课件原有定义、结论、流程和术语整理;为便于复习,页面结构做了重组,但不是逐页转写。
2. 带有 非课件内容 的内容,是为了让第一次学这门课的人更容易理解,不应替代课件原文。
3. 课件 Lecture 8 的 logistics 页写了 “Scope: from Lecture 1 - Lecture 9”,但你已明确说明这是 typo,实际按 Lect01-08 复习。按你的更正

期中考试信息(来自 Lect08 Logistics: Midterm)

  • Midterm (40% of the total score)
  • One-page double-sided A4 cheat sheet,handwrite or print both OK
  • Questions are all in English;如果某个术语课堂里已经讲过,考试时不会再额外解释
  • No dictionaries or calculators are allowed
  • Multiple-select questions:选错任何一个错误选项,该题记 $0$ 分;每少选一个正确选项,扣 $1$ 分;最低为 $0$ 分
  • Short answer questions:Explain why and how;some questions may require mathematical derivations

复习建议 非课件内容

高优先级

Lect02-04 机器人学基础

刚体变换、FK/IK、SO(3)、Euler angle / quaternion、motion planning、PD/PID,属于后续视觉与策略部分的共同语言。

高优先级

Lect07-08 策略学习

state / observation / action、MDP / POMDP、BC / DAgger、reward-to-go、baseline、actor-critic、discount 和 GAE,概念之间联系紧密。

中优先级

Lect05-06 视觉抓取

open-loop grasping、6D pose、ICP、continuous rotation representation、force closure、hand-eye calibration。

中优先级

Lect01 总览

要把 embodied AI 和 classical AI、generalist robots、VLA、synthetic data、Sim vs Real 的大图景记清楚。

推荐学习顺序非课件内容
先通读 Lect01 建立全局图景,再把 Lect02-04 当成“机器人学语言层”,接着看 Lect05-06 的视觉抓取管线,最后用 Lect07-08 把“policy learning”串起来。读完每章后立即做本章内嵌题和最后的互动自测。

考点路线图 整理导图

1

Lect01 Overview

What is a robot? classical special-purpose robots 的局限,Embodied AI,generalist robots,VLA,synthetic data,sim-to-real。

2

Lect02 Robotics I

Kinematics vs Dynamics,rigid transformation,homogeneous coordinates,joint/link/DoF,FK/IK,Pieper's criterion,$SO(3)$ / $SE(3)$ 入门。

3

Lect03 Robotics II

rotation representations:Euler angle,gimbal lock,angle-axis,Rodrigues,quaternion,slerp,distance on $SO(3)$。

4

Lect04 Robotics III

motion planning,configuration space,collision checking,PRM / RRT / RRT-Connect,path vs trajectory,P / PD / PID,PD tuning。

5

Lect05 Vision and Grasping I

open-loop grasping,4-DoF / 6-DoF grasp,6D object pose estimation,ICP,rotation regression,FoundationPose,NOCS。

6

Lect06 Vision and Grasping II

grasp detection,voxel / point cloud,GS-Net,DexGraspNet 2.0,force closure,GraspNet-1Billion,camera model,PnP,AX=XB。

7

Lect07 Policy I

policy / state / observation / dynamics model / world model,MDP,BC,distribution drift,teleoperation,HG-DAgger,teacher policy。

8

Lect08 Policy II

RL,reward,sparse vs dense,online / offline,REINFORCE,reward-to-go,baselines,actor-critic,discount,GAE。

第 1 章 · Lect01 Overview

1.1 课程目标

  • A frontier course on Embodied AI
  • Covering basics in robotics and deep learning based vision and robotic system, from a modern perspective of Embodied AI
  • To lay a solid foundation for conducting research in Embodied AI

1.2 机器人与机器人学

  • A Robot is a machine capable of carrying out a complex series of actions automatically.
  • Robots can be guided by an external control or attached within.
  • Robots may be constructed to take on human form.
  • Robotics is a branch of engineering that involves the conception, design, manufacture and operation of robots.
  • The objective of the robotics field is to create intelligent machines that can assist humans in a variety of ways.

1.3 从 classical special-purpose robots 到 Embodied AI

  • Car factory 中的 classical special-purpose robots:Predesign and compute the trajectory,使用时 Only replay trajectory when using the arm
  • Limitation 1: time-consuming deployment
  • Limitation 2: can’t flexibly handle multi-tasks
  • 课件结论:Very different from human intelligence!
理解补充非课件内容
这一页其实在说明:传统工业机器人更像“预先编程的轨迹执行器”,而不是能在开放环境里自主感知、决策、交互的智能体。后面整门课,就是在补齐“感知-决策-控制-交互”这条链。

1.4 Embodied AI 的核心直觉

  • Perception-Action Loop
  • How Humans Learn: Perceive, forms hypotheses, and then take action to examine.
  • Our brain makes sense of the world around us by creating and testing hypotheses about how the world works.

1.5 人类智能演化与具身性

课件给出的 human intelligence 演化关键词:

关键词课件位置
Walk UprightEvolution of Human Intelligence
Embodied InteractionEvolution of Human Intelligence
Tool UsageEvolution of Human Intelligence
LanguageEvolution of Human Intelligence

1.6 未来图景:Generalist Robots

  • Past & Existing: Industrial robots
  • Now & Happening: Autonomous cars
  • Future & Our Dream: Generalist Robots - Humanoids
  • 课件对 generalist robots 的概括:Task generalistsEnvironment generalists

1.7 Robot brain 的二层结构

部分课件表述
CerebrumResponsible for high-level policies; Deciding what to do
CerebellumResponsible for low-level motor policies; Determining how to execute actions in a fast, accurate, and reliable manner

To Build Generalist Robots 时,课件把 robot brain 进一步写成:

模块内容
CerebrumPerception, planning, decision making, language, etc.; Vision-Language-Action model (VLA)
CerebellumMotion control, trajectory tracking, stability, error feedback; Whole-body and whole-hand control; Learned mainly via Reinforcement Learning (RL)

1.8 Vision-Language-Action 与数据瓶颈

  • VLA:Inputs: language, vision, other proprioceptive / sensor signals; Outputs: robot actions
  • Advantages: end-to-end, benefit from VLM pretraining
  • Limitations: zero-shot performance lagging behind LLMs & VLMs
  • Real-world teleoperation is too costly to be scalable
  • Embodied foundation model pretraining may require trillions of trajectories
  • The largest VLA dataset is at the scale of 1M trajectories

1.9 Simulation 的角色

  • Simulation and Photorealistic Rendering: Annotation-free, Time efficient, Transferable to real world
  • For problems already done in real, learning in simulation requires generalization to real.
  • For problems still far to be done in real, simulation environment is a good place for exploration.

本章表达式汇总

Perception-Action Loop:Perceive $\to$ form hypotheses $\to$ take action $\to$ perceive again

VLA:Inputs = language + vision + proprioceptive / sensor signals;Outputs = robot actions

Robot brain 分层:Cerebrum $\to$ decide what to do;Cerebellum $\to$ decide how to execute actions

概念辨析 · Lect01
以下哪项最符合课件对 VLA 的描述?
A. 输入是 robot actions,输出是 language
B. 输入包括 language、vision、其他 sensor signals,输出是 robot actions
C. 只处理 text,不处理 motion control
D. 只在 simulation 中使用,不能连接真实机器人
答案:B。Lecture 1 写的是 Inputs: language, vision, other proprioceptive / sensor signals;Outputs: robot actions。
Lect01
以下哪一项最符合 Lecture 1 对未来机器人方向的描述?
A. 更强的轨迹 replay 系统
B. 只做单一工位任务的工业臂
C. Generalist Robots - Humanoids
D. 只会语言推理的 disembodied AI
答案:C。课件在 “Past, Now and Future of Robotics” 中把未来写成 “Generalist Robots - Humanoids”。

第 2 章 · Lect02 Robotics I

2.1 Kinematics vs. Dynamics

  • Kinematics: describing the motion of bodies (position, linear/angular velocity, linear/angular acceleration, etc.)
  • Kinematics does not consider how to achieve motion via force
  • Dynamics models all the way from force and torque to the motion

2.2 Link / Joint / DoF

  • Link: the rigid-body connected in sequence
  • Joint: the connectors between links
  • DoF: the number of independent parameters that define configuration
  • 常见关节:Revolute (R)Prismatic (P)
  • 其他关节:Helical (H)Spherical (S)

2.3 刚体变换与坐标变换

  • 把随刚体运动的坐标系记为 body frame $\mathcal{F}_b$,观察者坐标系记为 $\mathcal{F}_s$
  • Pose 的问题是:How to transform $\mathcal{F}_s$ so that it overlaps with $\mathcal{F}_b$?
  • 先旋转 $R_{s\to b}$ 对齐坐标轴,再平移 $t_{s\to b}$ 对齐原点
  • 点坐标关系:$p^s = R_{s\to b} p^b + t_{s\to b}$
  • 对任意点:${x'}^s = R_{s\to b} x^s + t_{s\to b}$

2.4 为什么引入 homogeneous coordinates

  • $(R_{s\to b}, t_{s\to b})$ 不是线性变换,因为平移项破坏了线性性
  • Homogeneous coordinate for 3D space: $\tilde{x} \in \mathbb{R}^4$
  • Homogeneous transformation matrix: $T_{s\to b} = \begin{bmatrix} R_{s\to b} & t_{s\to b} \\ 0 & 1 \end{bmatrix}$
  • Coordinate transformation under linear form: $\tilde{x}^s = T_{s\to b} \tilde{x}^b$

2.5 齐次变换的两条规则

规则公式
Composition rule$T_{3\to 1} = T_{3\to 2} T_{2\to 1}$
Change of observer's frame$T_{2\to 1} = (T_{1\to 2})^{-1}$

2.6 Base link, end-effector, joint space, Cartesian space

  • Base link / root link:the $0$-th link of the robot;the spatial frame $\mathcal{F}_s$ is attached to it
  • End-effector link:the last link;例如 gripper;frame $\mathcal{F}_e$ attached to it
  • Joint space:each coordinate is a vector of joint poses
  • Cartesian space:the space of rigid transformations of the end-effector by $(R_{s\to e}, t_{s\to e})$

2.7 Forward Kinematics 与 Inverse Kinematics

  • Forward Kinematics:map the joint space coordinate $\theta \in \mathbb{R}^n$ to a transformation matrix $T$,即 $T_{s\to e} = f(\theta)$
  • FK 是通过沿 kinematic chain 复合变换得到
  • Inverse Kinematics:给定 $T_{s\to e}(\theta)$ 和目标位姿 $T_{target} \in SE(3)$,求满足 $T_{s\to e}(\theta) = T_{target}$ 的 $\theta$
  • Workspace:volume swept out by the end-effector as the robot does all possible motions

2.8 IK 的难点与解法分类

  • IK 可能有多个解,也可能无解
  • 即便有解,也可能 require complex and expensive computations
  • 分类:analytical methodsnumerical methods
  • Analytical methods:直接数学求 closed-form exact solution,只适用于 relatively simple chains
  • Numerical methods:use approximation and iteration;通常更 expensive,但 far more general purpose

2.9 Pieper's criterion

Pieper's criterion 用于判断一个 6-DOF robotic arm 是否存在 closed-form inverse kinematics solution。课件给出的充分条件:
1. The axes of three consecutive rotational joints intersect at a single point.
2. The axes of three consecutive rotational joints are parallel.
注意:this is only a sufficient condition but not necessary

2.10 $SO(3)$ 与 $SE(3)$

空间定义含义
$SO(3)$$\{R \in \mathbb{R}^{3\times 3}: \det(R)=1, RR^T=I\}$3D rotations, 3 DoF
$SE(3)$$T = \begin{bmatrix}R & t \\ 0 & 1\end{bmatrix}, R\in SO(3), t\in\mathbb{R}^3$rigid transformations, 6 DoF

本章公式汇总

Point transform:$p^s = R_{s\to b} p^b + t_{s\to b}$

General transform:${x'}^s = R_{s\to b} x^s + t_{s\to b}$

Homogeneous transform:$T_{s\to b}=\begin{bmatrix}R_{s\to b} & t_{s\to b} \\ 0 & 1\end{bmatrix}$

Composition:$T_{3\to1}=T_{3\to2}T_{2\to1}$

Inverse:$T_{2\to1}=(T_{1\to2})^{-1}$

Forward kinematics:$T_{s\to e}=f(\theta)$

概念辨析 · Lect02
以下关于 forward kinematics 和 inverse kinematics 的说法,哪一项正确?
A. FK 是由末端位姿反求关节角
B. IK 一定唯一且一定有解
C. FK 是从 joint space 到末端变换,IK 是从目标位姿求解关节变量
D. IK 只能用 analytical methods
答案:C。Lecture 2 中 FK 定义为 $T_{s\to e}=f(\theta)$,IK 是给定目标位姿反求 $\theta$;IK 可能多解,也可能无解。
Lect02
以下哪项是 homogeneous transformation 的正确写法?
A. $T = \begin{bmatrix}R & t \\ 0 & 1\end{bmatrix}$
B. $T = R + t$
C. $T = \begin{bmatrix}1 & R \\ t & 0\end{bmatrix}$
D. $T = R^TR$
答案:A。Lecture 2 直接给出了 homogeneous transformation matrix 的标准形式。

第 3 章 · Lect03 Robotics II

3.1 Rotation 的基本性质

  • A rotation preserves lengths
  • Cross products are preserved by a rotation
  • 因此 rotation matrices 满足:$RR^T = R^TR = I$,$\det(R)=1$

3.2 为什么 rotation matrix 不理想

  • Efficiency:rotation has $3$ DoF, however a rotation matrix needs $9$ values
  • Numerical stability:需要 maintain orthogonality

3.3 Euler angle

  • Euler angles are three angles introduced by Leonhard Euler to describe the orientation of a rigid body with respect to a fixed coordinate system.
  • Lecture 3 给出的组合:$R = R_z(\gamma) R_y(\beta) R_x(\alpha)$
  • 优点:good interpretability

3.4 Gimbal lock 与 Euler angle 的局限

当 $\beta = \pi/2$ 时,课件展示了 changing $\alpha$ and $\gamma$ has the same effects,因此 a degree of freedom disappears
结论:Euler angle can parameterize every rotation, but it is not a unique representation at some points,并且存在一些点,target space 的变化不能通过 source space 的变化实现。

3.5 Angle-axis 与 Euler's theorem

  • Any rotation is equivalent to a rotation about a fixed axis $\hat{\omega}\in\mathbb{R}^3$ with $\|\hat{\omega}\|=1$ through a positive angle $\theta$
  • $\hat{\omega}$ 是 unit vector of rotation axis
  • $\theta$ 是 angle of rotation
  • $R \in SO(3) := \mathrm{Rot}(\hat{\omega}, \theta)$

3.6 Rodrigues formula 与 exponential coordinate

  • Skew-symmetric matrix:若 $A = -A^T$,则 $A$ 是 skew-symmetric
  • Cross product 可写成线性形式:$a \times b = [a]b$
  • Rodrigues formula:$e^{[\omega]\theta} = I + [\omega]\sin\theta + [\omega]^2(1-\cos\theta)$
  • $\vec{\theta} = \hat{\omega}\theta$ 也叫 rotation vector,或 exponential coordinate

3.7 angle-axis 参数化的非唯一性

  • $(\hat{\omega}, \theta)$ 和 $(-\hat{\omega}, -\theta)$ 给出同一个 rotation
  • 当 $R=I$ 时,$\theta=0$,轴可以 arbitrary
  • $(\hat{\omega}, \pi)$ 和 $(-\hat{\omega}, \pi)$ 给出同一个 rotation
  • 若约束 $\theta\in(0,\pi)$,则 unique parameterization exists

3.8 Quaternion

  • Quaternion 是 generalized complex number:$q = w + x\mathbf{i} + y\mathbf{j} + z\mathbf{k}$
  • Vector form:$q = (w, \vec{v})$
  • Unit quaternion 可以 represent a rotation
  • Four numbers plus one constraint $\Rightarrow 3$ DoF
  • Rotate a vector 的形式:先把 $\vec{x}$ augment 成 $x=(0,\vec{x})$,再做 $x' = qxq^*$
  • Compose rotations by quaternion:just multiply quaternions
  • Each rotation corresponds to two quaternions(double-covering)

3.9 Quaternion vs rotation matrix

方面QuaternionRotation Matrix
Storage4 floating-point values9 values
Multiplication16 multiplications + 12 additions27 multiplications + 18 additions
Numerical stabilitynormalize 后可 maintain unit magnitudesuccessive operations may violate orthogonality

3.10 SLERP 与 rotation distance

  • Why not linear interpolation? 因为需要 normalized,而且 does not have a constant rate of rotation
  • Spherical linear interpolation (SLERP):shortest path between two points on sphere,traverses a great arc on the sphere of unit quaternions
  • 具有 uniform angular rotation velocity about a fixed axis
  • Rotation distance:$\mathrm{dist}(R_1, R_2) = \arccos\left(\frac{\mathrm{tr}(R_2 R_1^T)-1}{2}\right)$

3.11 课件最后的使用建议

表示课件建议用途
rotation matricesdefine concepts
Euler anglesvisualize rotations
angle-axisvisualize rotations and calculate derivatives
quaternionwrite fast codes

本章公式汇总

$SO(3)$:$\{R \in \mathbb{R}^{3\times 3}: RR^T=I, \det(R)=1\}$

Euler composition:$R = R_z(\gamma)R_y(\beta)R_x(\alpha)$

Rodrigues:$e^{[\omega]\theta}=I+[\omega]\sin\theta+[\omega]^2(1-\cos\theta)$

Quaternion rotation:$x' = qxq^*$,其中 $x=(0,\vec{x})$

Rotation distance:$\mathrm{dist}(R_1,R_2)=\arccos\left(\frac{\mathrm{tr}(R_2R_1^T)-1}{2}\right)$

概念辨析 · Lect03
以下关于几种 rotation representation 的说法,哪一项错误?
A. Euler angle 直观,但存在 gimbal lock
B. Quaternion 更适合高效代码实现
C. 同一个 rotation 对应两个 quaternions
D. Rotation matrix 只需要 3 个数,因此比 quaternion 更紧凑
答案:D。Rotation matrix 需要 $9$ 个数,而 quaternion 需要 $4$ 个数加一个单位范数约束,所以 quaternion 更紧凑。
Lect03
为什么课程里既讲 rotation matrix,也讲 Euler angle、angle-axis、quaternion?
因为不同表示法的优缺点不同。课件最后给出的总结是:rotation matrices 用来 define concepts,Euler angles 用来 visualize rotations,angle-axis 用来 visualize rotations and calculate derivatives,quaternion 用来 write fast codes。也就是说,本课不是要你“只会一种表示”,而是要你知道什么时候该用哪一种。 非课件内容

第 4 章 · Lect04 Robotics III

4.1 Robotics stack 的大图

课件关键词核心问题
Goaltask / target要做什么
Motion Planningcollision-free searchwhat motion is collision-free and geometrically feasible?
Trajectorypath / waypoints如何给路径加时间
Controltrack & stabilizehow do we track that motion reliably in the real system?
Robotexecution真实执行

4.2 Motion planning 的问题定义

  • 问题:From one robot pose to another robot pose, how robot move safely in the world?
  • Input: a start state $q_{start}$, a goal state $q_{goal}$, and the collision-free space $C_{free}$
  • Output: a feasible path connecting the start to the goal (not actions!)
  • Core: Motion planning is a search problem in a high-dimensional constrained space

4.3 Workspace vs Configuration space

空间课件描述
WorkspaceReal 3D world; obstacle, robot, shelf, objects
Configuration spaceOne point correspond to one joint configuration
  • Planning is usually performed in configuration space
  • The robot is not a point. Its shape, joints, and limits must be considered.
  • Collision constraints depend on the full robot configuration
  • A straight line connecting the start to the goal may get collision

4.4 Collision checking

  • Whether a configuration $q$ is in $C_{free}$? Run collision check!
  • Collision checking is called repeatedly inside planning algorithms
  • Need to be very fast and maintain accuracy
  • However, accurate collision check is very slow
  • Visual mesh for rendering $\neq$ Collision mesh for collision check
  • Collision mesh of each link is defined in URDF,通常是 geometry approximation
  • Convex-convex collision checking is usually very fast on CPU

4.5 Grid-based vs sample-based planning

  • Grid-based:discretize the whole configuration space,compute $C_{free}$,用 search algorithm,如 A*
  • 缺点:too slow for high-dimensional space
  • Sample-based algorithms:PRM, RRT, RRT-Connect 等

4.6 PRM / RRT / RRT-Connect / Shortcutting

方法课件信息
PRMProbabilistic Roadmap;asymptotically optimal but requires massive sampling
RRTRapidly-exploring Random Tree
RRT-Connectsingle-query path planning 的经典算法
Shortcuttingvery useful post-processing heuristic;fast local optimization;改善 jerky, unnatural paths
理解补充非课件内容
这一段要记住的不是每个算法的伪代码,而是三件事:第一,planning 本质上是在高维受限空间里找路;第二,sample-based 方法更适合高维;第三,planner 输出的 path 通常并不“可直接执行”,还需要 trajectory 和 control。

4.7 OMPL

  • OMPL = Open Motion Planning Library
  • Default planning library in ROS-MoveIt
  • Geometric planners:只考虑 geometric and kinematic constraints;假设 any feasible path can be turned into a dynamically feasible trajectory
  • Control-based planners:若系统 subject to differential constraints,则用 state propagation 而不是 simple interpolation

4.8 From path to trajectory

  • Path (Geometry): tells us where to go
  • Trajectory (Time): tells us when to be there and how fast to move
  • Path 表示为 $q(s)$,其中 $s\in[0,1]$
  • Time parameterization: $s=s(t) \Rightarrow q(t)=q(s(t))$
  • 真实机器人有关节速度和加速度上限,因此 geometric path 本身不够
  • 课件例子:Minimum Jerk Trajectory 使用 $5$ 阶多项式,保证 smooth starts and stops,minimizes jerk

4.9 Control system

  • A control system regulates a system's behavior to follow a reference trajectory $x_{ref}(t), \dot{x}_{ref}(t)$, despite disturbances and uncertainties
  • Main components:Sensor, Controller, Environment/System, Actuator
  • Open loop (feedforward):control signal based only on reference
  • Closed-loop (feedback):controller uses the error to adjust the control signal
  • Closed-loop 优点:robust to model uncertainties,rejects disturbances and noise,accurately tracks reference signals

4.10 Error metrics

  • Tracking error: $x_e = x_{ref} - x$
  • Steady-state error:$e_{ss}$
  • Transient response metrics:rise time,settling time,overshoot
  • 目标:small tracking error,small $e_{ss}$,small settling time,minimal oscillation

4.11 P / PD / PID

控制器课件核心表述
PServe as a virtual spring;$u = K_p x_e$
PDD term is velocity feedback term as virtual damping;accounts for future behavior / trend;attempts to reduce overshoot
PIDI term is error accumulation term to eliminate steady-state offset;reacts to past behavior / history
  • P only:small $K_p$ 时 slow response / large tracking error;large $K_p$ 时 fast but oscillatory
  • PD:$K_p$ 决定 action 强度;$K_d$ 决定 damping
  • PID:I term can eliminate steady-state error

4.12 为什么现代机器人里常偏好 PD 而不是纯 PID

  • PD is often sufficient for tracking mechanical systems, simpler to tune, and usually more robust in real hardware
  • In many robots, the main challenge is fast and stable motion, not eliminating tiny steady-state errors
  • In contact-rich tasks, integral action may lead to unwanted force buildup
  • To reduce or eliminate steady-state error, modern robots often combine PD control with model-based compensation, gravity compensation, and feedforward torque

4.13 PD tuning

  • Second-order system 由 natural frequency $\omega_n$ 和 damping ratio $\xi$ 描述
  • $\xi > 1$ overdamped;$\xi < 1$ underdamped;$\xi = 1$ critically damped
  • 课件给出 matching terms 的结论:更大 $\omega_n$ 对应更大 P gain;更大 damping 对应更大 D gain
  • 这通常用于 heuristically design a good initial set of PD parameters,之后 still needs finetuning

4.14 Multi-DoF system 的经验规则

  • Rule of thumb:apply the single-joint tuning method to each joint
  • Humanoid 中 ankle joints require more compliance
  • Higher D gain 虽可 reduce oscillation,但也会 amplify noise from velocity sensors
  • All PD tuning must respect hardware torque limits
  • Robotic arms 因为建模更准确、环境更可控,还可用 feedback linearization 等 model-based methods

本章公式汇总

Path:$q(s), s\in[0,1]$

Time parameterization:$s=s(t) \Rightarrow q(t)=q(s(t))$

Tracking error:$x_e=x_{ref}-x$

P control:$u=K_p x_e$

PD control:$u=K_p x_e + K_d \dot{x}_e$

PID control:$u=K_p x_e + K_i \int x_e dt + K_d \dot{x}_e$

Standard second-order form:$\ddot{x}+2\xi\omega_n\dot{x}+\omega_n^2 x=0$

概念辨析 · Lect04
以下关于 path、trajectory 和 control 的说法,哪一项正确?
A. Path 回答去哪,trajectory 回答何时到以及如何快慢变化,control 负责跟踪
B. Trajectory 只包含几何,不包含时间
C. Open-loop 比 closed-loop 更抗扰动
D. PD 中的 D 项用于消除所有 steady-state error
答案:A。Lecture 4 中 path 是 geometry,trajectory 是带时间的路径,control 负责跟踪;closed-loop 更抗扰动,I 项才用于消除 steady-state offset。
Lect04
关于 motion planning 的输出,哪一项最准确?
A. 直接输出关节力矩序列
B. a feasible path connecting the start to the goal
C. value function
D. reward model
答案:B。课件明确强调是 path,不是 actions。

第 5 章 · Lect05 Vision and Grasping I

5.1 从机器人学到抓取流水线

  • Kinematics:conversion between Cartesian space and joint space
  • Motion Planning:get a geometrically valid path
  • Control:ensures the motion follow the planned trajectory
  • If we can provide where to move, a.k.a. end-effector pose in Cartesian space, this already forms a pipeline to grasp objects.

5.2 Grasping 与 grasp pose

  • Grasping:restraining an object's motion in a desired way by applying forces and torques at a set of contacts
  • Grasp Synthesis:high-dimensional search or optimization problem to find gripper poses or joint configurations
  • Grasp Pose defines the position, orientation and articulation of a hand
  • 4-DoF grasp:a 3D position and 1D hand orientation aligned with the direction of gravity,a.k.a. top-down grasping
  • 6-DoF grasp:a 3D position and 3D orientation

5.3 Open-loop grasping 的两条路线

路线课件描述
Path IFor known objects with labeled grasps: 6D object pose estimation $\to$ further get grasping pose from object pose $\to$ motion planning
Path IIFor unknown and general objects: directly predict grasping pose $\to$ motion planning

5.4 6D object pose

  • Definition:6D transformation from object to camera space
  • 包含 3DoF translation3DoF rotation
  • Instance-level 6D pose estimation:small set of known instances;pose defined according to their CAD model;input 可为 RGB / RGBD;输出 object pose(s)

5.5 PoseCNN

  • 课件用 PoseCNN 作为 instance-level 6D pose estimation 的代表方法
  • 其中 rotation estimation 页明确写了:Regress quaternion

5.6 ICP

  • ICP is a method for point cloud registration
  • 输入两组 point clouds,求 $R$ 和 $T$ 使一组点云 align to the other as closely as possible
  • 步骤:make data centered,find correspondences,solve constrained orthogonal Procrustes,obtain translation,iterate
  • termination condition:变换小于阈值 / loss 变化小于阈值 / 达到最大迭代次数
ICP 优点ICP 缺点
Simple;no need for segmentation or feature extraction;good initialization 下 accuracy and convergence decentfinding correspondences cost high;只考虑 point-to-point distances;highly dependent on the accuracy of the initial estimate

5.7 Rotation regression 的表示问题

  • 3D rotation only has $3$ DoF,但 rotation matrix 有 $9$ 个元素,makes neural network harder to predict
  • 可选表示:Euler angle、axis angle、quaternion
  • 这些表示 often suffer from singularities and discontinuities
  • Euler angles discontinuous;axis-angle 在 $\theta=0$ 与 $\theta=\pi$ 附近有问题;quaternion 有 double coverage

5.8 Continuous rotation representations

  • 6D representation:simply eliminates the last column of rotation matrix
  • 再用 Gram-Schmidt orthogonalization 把网络输出映射到 $SO(3)$
  • 9D representation:对应 rotation matrix,再用 SVD orthogonalization 映射到 $SO(3)$

5.9 FoundationPose

  • FoundationPose 是 a unified foundation model for 6D object pose estimation and tracking
  • supports both model-based and model-free setups
  • 初始化流程:先检测对象,再用 bbox center 对应的 $3D$ 点和 median depth 初始化 translation,再从 icosphere 采样 viewpoints 初始化 rotations
  • Pose conditioned input crop:用 coarse pose 的渲染和输入 crop 输出 refined pose

5.10 Category-level 6D pose estimation 与 NOCS

  • 目标:estimate 6D pose and 3D size of previously unseen objects from certain categories
  • 关键是 NOCS = Normalized Object Coordinate Space
  • 三步 normalization:rotation normalization,translation normalization,scale normalization
  • Category-level pose 是 transformation from NOCS to camera space
  • From image to NOCS map to pose:predict NOCS map,结合 depth backproject,再做 pose fitting with RANSAC

本章公式汇总

6D object pose:3DoF translation + 3DoF rotation

ICP objective:求 $(R,T)$ 使 source point cloud 尽量对齐 target point cloud

Quaternion regression:用 quaternion 预测 rotation

6D rotation representation:取 rotation matrix 前两列,再用 Gram-Schmidt orthogonalization 映射到 $SO(3)$

Category-level pose:transformation from NOCS to camera space

概念辨析 · Lect05
以下关于 6D pose estimation、ICP 和 NOCS 的说法,哪一项正确?
A. ICP 与初始估计无关,因此总能稳定收敛
B. NOCS 只适用于已知 CAD 的 instance-level pose estimation
C. Category-level pose estimation 里,NOCS 充当规范参考坐标系
D. 6D object pose 只包含 rotation,不包含 translation
答案:C。Lecture 5 里 category-level pose 是从 NOCS 到 camera space 的变换;ICP 对初始化敏感;6D object pose 包含 translation 和 rotation。
理解补充非课件内容
Lecture 5 可以概括成一句话:先决定“抓哪里”,再决定“手以什么位姿去抓”,其中 6D object pose estimation 是 known objects 路线里的核心中间变量。

第 6 章 · Lect06 Vision and Grasping II

6.1 MuJoCo

  • MuJoCo = Multi-Joint dynamics with Contact
  • general-purpose physics engine designed for fast and accurate simulation of articulated structures interacting with their environments

6.2 两种 grasp 模型形式

形式课件描述
Grasp detectiondetect multiple grasp poses from observations
Conditional grasp generationcondition: observation; output: grasp poses

课件特别说明:Due to multi-modal grasp distribution, formulate grasping as a detection problem.

6.3 Visual input representation

表示课件关键词特点
Voxel GridsVGNExplicit geometry;limited by volume resolution
Point CloudPointNet / PointNet++ / GraspNet-baselineExplicit geometry;less memory cost;higher resolution

6.4 抓取评估指标

  • Success Rate:the ratio of successful grasp executions
  • Percent cleared:the percentage of objects removed during each round
  • Planning time:the time between receiving input and returning grasps

6.5 GS-Net

  • GS-Net 把 grasp pose detection in the wild 视作 two-stage problem
  • Where stage:locations with high graspability
  • How stage:decide grasp parameters,例如 in-plane rotation, approaching depth, gripper width, grasp score
  • graspness:a novel geometrically based quality for distinguishing graspable area in cluttered scenes
  • graspness 可分为 point-wise 与 view-wise graspness scores
  • “success rate” 的计算依赖 force closurecollision checking

6.6 DexGraspNet 2.0

  • DexGraspNet 2.0: Diffusion-based Dexterous Grasp Generation
  • Stage 1: Where to grasp,predict graspness and objectness scores
  • Stage 2: How to grasp,predict residual position、rotation、finger joint angles
  • 为处理 multi-modal grasping pose distribution,课件写明使用 diffusion model
  • DexGraspNet 2.0 contains 7600 training scenes with 426 million grasps

6.7 Force closure 与 form closure

  • Force Closure:如果 grasp 在一组 frictional contacts 上施加的力足以 compensate any external wrench applied to the object
  • 物理表述:the positive span of the wrench cones is the entire wrench space
  • Form closure:若 rigid body 被 rigid stationary fixtures fully immobilized
  • When planning a grasp by a robot hand, force closure is a good minimum requirement
  • Form closure is usually too strict, requiring too many contacts
  • 课件给出的关系:successful grasp $\le$ force closure $\le$ form closure

6.8 数据集

类型数据集课件说明
Object datasetShapeNet / ModelNet / Objaverse-XLObjaverse-XL: 10M+ 3D Objects
Synthetic grasp datasetACRONYMwith grasping annotation
Real grasp datasetGraspNet-1Billionwith grasping annotation

GraspNet-1Billion 的标注流程:sample grasp point from point cloud,再采样 grasp view / in-plane rotation / gripper depth 并评估,最后用 object 6D pose 投影到 scene,且做 collision detection。

6.9 Hand-eye calibration 的必要性

  • Grasp poses 最初是在 camera space 里估计的
  • 机器人要执行,就必须 transform poses to the robot space
  • 因此需要知道 camera 和 robot 之间的 transformation,即 hand-eye calibration
  • 定义:determining the precise geometric relationship (transformation matrix) between a robot's coordinate system and a camera's coordinate system

6.10 Camera model 与 calibration

  • Agenda:camera model: intrinsics and extrinsics
  • Goal of calibration:estimate camera intrinsics and extrinsic from one or multiple images
  • 若 intrinsics $K$ 已知或已标定,则可进一步 recover pose
  • Perspective-n-Point (PnP) 是课件给出的 camera calibration approach

6.11 Eye-in-hand vs Eye-to-hand

方式安装位置要求求解的关系
Eye-in-handcamera mounted on the robotcamera coordinate $\to$ end-effector coordinate
Eye-to-handcamera mounted stationary next to the robotcamera coordinate $\to$ robot base coordinate

6.12 AX = XB

Hand-eye calibration 的核心方程:$AX = XB$
在 eye-in-hand workflow 中:
$A$:end-effector pose change
$B$:camera pose change(由 marker pose 计算)
$X$:end-effector to camera

6.13 Eye-in-hand workflow

  • Rigidly attach the camera to the robot's end-effector
  • Fix a calibration target on a flat surface within workspace
  • Move the robot to 10-30 different poses where camera clearly sees the board
  • 每个 pose 记录 end-effector pose,并通过 solvePnP 计算 target-to-camera transformation
  • Best practices:orientation 要显著变化;cover different heights and tilt angles;avoid singular configurations;board 必须 fully visible

6.14 Validation

  • Reprojection Error:a low pixel error (usually $<1$ pixel) indicates a successful calibration
  • TCP Touch Test:如果 robot accurately touches the point,则 calibration physically verified
  • Eye-to-hand 对应的验证包括 hand-to-pixel test:virtual dot overlays perfectly with the physical gripper

6.15 Depth sensing problem

Lecture 6 最后用 transparent / specular objects 的例子展示了 depth sensing problem

本章公式汇总

Force closure:positive span of the wrench cones = entire wrench space

Grasp quality relation:successful grasp $\le$ force closure $\le$ form closure

Hand-eye equation:$AX = XB$

Camera model:intrinsics + extrinsics

PnP setting:已知内参和对应关系时恢复 pose

概念辨析 · Lect06
以下关于 force closure 与 hand-eye calibration 的说法,哪一项正确?
A. Form closure 通常比 force closure 更宽松
B. Hand-eye calibration 的作用是建立 camera coordinate system 与 robot coordinate system 的精确几何关系
C. Eye-to-hand 表示 camera 安装在 end-effector 上
D. AX = XB 用来计算 reward function
答案:B。Lecture 6 明确定义了 hand-eye calibration;form closure 比 force closure 更严格;eye-in-hand 才是 camera mounted on the robot。
Lect06
为什么 hand-eye calibration 在抓取里是必须的?
因为前面的视觉模块把物体和 grasp pose 估计在 camera space 里,而机器人执行动作需要的是 robot space 或 base / end-effector 相关坐标系中的位姿。如果不知道 camera 和 robot 之间的精确几何关系,视觉输出就无法稳定地映射成可执行的机器人动作。 非课件内容

第 7 章 · Lect07 Policy I

7.1 Policy 的基本定义

  • A policy is an end-to-end mapping: state $\to$ action
  • Stochastic:$a \sim \pi(a|s)$
  • Deterministic:$a = \pi(s)$

7.2 State

  • In Embodied AI, state $s_t$ at time step $t$ includes a complete description of the environment
  • Contains all information needed to predict the future
  • Usually not fully accessible
  • Under true state, the system is Markovian
  • Markov property:$P(s_{t+1}|s_t,a_t,s_{t-1},a_{t-1},\ldots)=P(s_{t+1}|s_t,a_t)$

7.3 Dynamics model 与 world model

  • $P(s_{t+1}|s_t,a_t)$ 叫 dynamics modeltransition model
  • 描述 world evolves,predict next state given current state and action
  • 有时 people also call it a world model
  • world model 通常是 learned 的,并可能 additionally come with reward model $r(s,a)$ 和 observation model $P(o|s)$

7.4 MDP

  • MDP:a framework for sequential decision making under uncertainty
  • 每个 time step:observe current state $s_t$,take action $a_t\sim\pi(a_t|s_t)$,environment transitions to $s_{t+1}\sim p(s_{t+1}|s_t,a_t)$
  • Markov property:next state depends only on current state and action

7.5 Observation

  • Observation 是 agent 实际从 sensors 收到的量
  • 包括 Exteroceptive information:vision, depth, tactile sensing, audio
  • 也包括 Proprioceptive information:joint angles, velocities, torques, motor states
  • Observations are typically partial, noisy and ambiguous
  • The same observation may correspond to different underlying states
  • Observations are often non-Markov, while states are defined to be Markov

7.6 State-based vs observation-based policy

  • 实践里 policy 往往 operate on observations rather than true states
  • 一般定义:a realistic robotic policy can be $a\sim\pi(a|o,l)$
  • 其中 $o$ 是 observation,$l$ 是 language or task instruction
  • State-based policy $a\sim\pi(a|s)$ 通常存在于 simulator,或 state 被其他算法估计出来

7.7 IL vs RL

方法前提关键思想
Imitation Learningaccess to an expertlearn to mimic expert behavior directly from data
Reinforcement Learningno expert availablelearn what to do by evaluating the consequences of actions

7.8 Behavior Cloning

  • BC 把 policy learning 当作 supervised learning
  • input: observation (or state)
  • output: action
  • For deterministic policy,usually adopt MSE loss
  • 最小化 MSE 等价于 Gaussian policy 下的 maximum likelihood(fixed covariance)

7.9 BC 的核心问题:distribution drift / mismatch

训练时数据来自 expert trajectories,只覆盖 states visited by expert;测试时数据来自 learned policy 访问到的状态。
因而会发生:Small mistake $\to$ new unseen state $\to$ larger mistake $\to$ OOD state,最终 errors accumulate over time

7.10 Embodied AI 中的数据采集

  • ALOHA-style master-slave teleoperation:human operator controls a master arm, kinematically coupled to a slave robot arm
  • 可记录 observation $o_t$ 和 action $a_t$,形成 teleoperation dataset $\mathcal{D} = \{(o_t,a_t)\}_{t=1}^T$
  • Observation 可含 visual observation 与 proprioception
  • Action 可是 joint-space target、task-space pose、或 very common 的 $\Delta x_t$

7.11 DAgger 与 HG-DAgger

  • Original DAgger 的问题:对于 $6$-DoF/$7$-DoF robotic arm 甚至 humanoid,人工给每个状态标动作非常困难
  • HG-DAgger:instead of labeling actions offline, the human intervenes during execution
  • 流程:run policy,human monitors execution,when policy makes mistake human takes over,record corrected data,aggregate and retrain
  • 优点:avoids manual action labeling;only labels when necessary;more data-efficient and practical

7.12 On-policy distillation

  • Key idea:replace human labeling with a teacher policy
  • Run student policy and collect states visited by the student
  • Query teacher policy for labels
  • Train student for one gradient step (no more than one to maintain pure on-policy)
  • Teacher policy 例子:privileged state-based policy;或 motion planner 直接给 action label

本章公式汇总

Stochastic policy:$a \sim \pi(a|s)$

Deterministic policy:$a = \pi(s)$

Markov property:$P(s_{t+1}|s_t,a_t,s_{t-1},a_{t-1},\ldots)=P(s_{t+1}|s_t,a_t)$

Observation-based policy:$\pi_\theta(a_t|o_t)$

General robotic policy:$a \sim \pi(a|o,l)$

Teleoperation dataset:$\mathcal{D}=\{(o_t,a_t)\}_{t=1}^T$

概念辨析 · Lect07
以下关于 state、observation 和 policy 的说法,哪一项正确?
A. Observation 往往是 partial、noisy、ambiguous,而 true state 才按定义满足 Markov
B. Observation 一定比 state 更完整
C. 机器人策略只能写成 $\pi(a|s)$,不能依赖 observation
D. BC 的训练和测试状态分布天然完全相同
答案:A。Lecture 7 明确区分了 state 与 observation;实践中 policy 往往基于 observation,而 BC 的关键问题正是 distribution mismatch。
Lect07
Behavior Cloning 在机器人里最典型的问题是什么?
A. reward 太稀疏
B. distribution mismatch / error compounding
C. 无法处理 deterministic policy
D. 只能用于 state-based policy,不能用于 observation-based policy
答案:B。课件写的是 distribution drift / distribution mismatch,并说明 Small mistake $\to$ new unseen state $\to$ larger mistake $\to$ OOD state,最终 errors accumulate over time。

第 8 章 · Lect08 Policy II

8.1 RL 的基本定义

  • RL = Learning a policy by interacting with an environment
  • Objective:maximize cumulative reward
  • No need for expert demonstrations

8.2 Reward function

Reward function $r(s,a) \in \mathbb{R}$ is a scalar signal that evaluates the quality of an action in a state and provides immediate feedback from the environment.
  • Key property:Local & myopic,only reflects instant outcome
  • Can be sparse or dense
  • Sparse:reward only at success/failure,reward may be delayed,requires credit assignment
  • Dense:frequent feedback shaping behavior

8.3 POMDP

  • Lecture 8 在 MDP 基础上引入 POMDP(Partially Observed Markov Decision Process)
  • 对视觉 policy 来说,$s_t$ 往往 unavailable,只能 observe $o_t$

8.4 Online RL, on-policy, off-policy, offline policy learning

概念课件表述
Online RLallows interaction with the environment while doing RL
On-Policy RLtrain a policy using experiences collected from the most recent policy
Off-Policy RLuse data collected throughout training and stored in buffer $D$;more sample efficient
Behavior cloninglearning a policy via imitating expert demonstration;no need of reward;not RL
Offline RLcollect data from any policy, store in $D$,then no further interaction;need reward;it is one type of RL

8.5 Monte Carlo approximation

  • Sample $N$ trajectories and approximate an expectation using sample averages
  • Replace intractable expectation with empirical mean
  • 这是 Law of Large Numbers 的应用
  • 是 true expectation 的 unbiased estimator

8.6 REINFORCE 的特性

  • Given a regular size of samples, policy gradient from REINFORCE is very noisy
  • high variance
  • still unbiased

8.7 Policy gradient in POMDP

  • For visual policy, replace $s_t$ by $o_t$
  • 课件原句:We can use policy gradient in POMDPs by simply modifying $s_t \to o_t$.
  • policy 变成 $\pi_\theta(a_t|o_t)$

8.8 Causality 与 reward-to-go

Causality:actions only affect future rewards, not past ones。
因此对每个 time step $t$,可以用 reward-to-go 替代整个 episode 的 total return,从而降低方差。

8.9 Baseline

  • 课件结论:subtracting a baseline is unbiased in expectation
  • average reward is not the best baseline, but it's pretty good
  • 引入 baseline 的主要目标是 reduce variance

8.10 Actor-critic

部分作用
Actorthe policy
Criticvalue function
  • Actor-critic 通过 critic 来 estimate return,从而 reduce variance of policy gradient
  • Batch actor-critic:trajectory-based gradient evaluation
  • Online actor-critic:transition-based gradient evaluation

8.11 Discount factor $\gamma$

  • higher $\gamma$ means considering a longer future
  • smaller $\gamma$ focuses more on immediate rewards and transitions
  • 课件后面还强调:discount = variance reduction

8.12 N-step returns 与 GAE

  • n-step return:single parameter knob $(n)$ that balances bias and variance by deciding how long you trust the real trajectory before bootstrapping
  • GAE:weighted combination of n-step advantage / n-step returns
  • 课件直观表述:Mostly prefer cutting earlier (less variance),用 exponential falloff 加权
  • Typical choices in on-policy methods:$\gamma \approx 0.99, \lambda \approx 0.95$

8.13 课件最后的 RL 总结

  • Actor-critic algorithms:reduce variance of policy gradient
  • Policy evaluation:fitting value function to policy
  • Discount factors:既是 temporal horizon,也可看作 variance reduction trick
  • Actor-critic design:one network with two heads or two networks;batch-mode or online (+ parallel)
  • State-dependent baselines:another way to use the critic;可与 n-step returns 或 GAE 结合

本章公式汇总

Reward function:$r(s,a) \in \mathbb{R}$

Policy in POMDP:$\pi_\theta(a_t|o_t)$

Reward-to-go:$G_t = \sum_{t'=t}^{T} r_{t'}$

Q-function:reward-to-go 在给定 $(s_t,a_t)$ 条件下的期望

State-based baseline:$V(s_t)$

Typical GAE hyperparameters:$\gamma \approx 0.99, \lambda \approx 0.95$

概念辨析 · Lect08
以下关于 reward-to-go、baseline 和 actor-critic 的说法,哪一项正确?
A. Subtracting a baseline 会必然引入偏差,因此不能用
B. Reward-to-go 的作用是把 future rewards 也删掉
C. Actor-critic 用 critic 估计 value / return,主要目的是降低 policy gradient 的方差
D. $\gamma$ 越小,就一定看得越远
答案:C。Lecture 8 强调 baseline 在期望下不引入偏差,reward-to-go 利用 causality 保留 future rewards,actor-critic 的核心作用之一是 reduce variance。
大题精解 · 为什么 reward-to-go 合理?
课件在 Lect08 中用 causality 说明:actions only affect future rewards, not past ones。解释为什么这允许我们把每个时间步的 total return 改写成 reward-to-go。
关键不是“改写后更准确”,而是“改写后仍然无偏,但方差更小”。

按课件思路,把总回报拆成两部分:过去奖励和未来奖励。对于时间步 $t$ 的策略梯度项,当前动作 $a_t$ 不可能影响已经发生的过去奖励,因此过去那部分在期望里贡献为 $0$。

所以,和当前动作真正相关的只剩下从 $t$ 开始往后的那段 return,也就是 reward-to-go。

这一步的收益是:删掉了与当前动作无关、但会增加噪声的过去奖励项,因此估计方差下降。 非课件内容

第 11 章 · 互动自测

具身智能导论期中自测

0 / 0

第 12 章 · 考前速查表

Lecture 1-4:机器人学基础

Rigid transform:$p^s = R_{s\to b} p^b + t_{s\to b}$

Homogeneous transform:$T = \begin{bmatrix}R & t \\ 0 & 1\end{bmatrix}$

Composition:$T_{3\to1}=T_{3\to2}T_{2\to1}$

Inverse:$T_{2\to1}=(T_{1\to2})^{-1}$

Forward kinematics:$T_{s\to e}=f(\theta)$

Rodrigues:$e^{[\omega]\theta}=I+[\omega]\sin\theta+[\omega]^2(1-\cos\theta)$

Rotation distance:$\mathrm{dist}(R_1,R_2)=\arccos\left(\frac{\mathrm{tr}(R_2R_1^T)-1}{2}\right)$

Lecture 4:控制

Tracking error:$x_e = x_{ref} - x$

P control:$u = K_p x_e$

PD control:$u = K_p x_e + K_d \dot{x}_e$

PID control:$u = K_p x_e + K_i \int x_e dt + K_d \dot{x}_e$

$K_p$:increase speed, reduce steady-state error, but may increase overshoot

$K_d$:acts as damping / brake, reduce overshoot and shorten settling time

Lecture 5-6:视觉抓取

4-DoF grasp:3D position + 1D hand orientation aligned with gravity

6-DoF grasp:3D position + 3D orientation

6D object pose:object to camera space 的 6D transformation

ICP:point cloud registration,依赖 good initialization

Force closure:good minimum requirement for grasp planning

Hand-eye calibration:求 camera 与 robot coordinate systems 的 precise geometric relationship

Hand-eye equation:$AX = XB$

Lecture 7-8:策略学习

Policy:state / observation $\to$ action

Markov property:$P(s_{t+1}|s_t,a_t,\ldots)=P(s_{t+1}|s_t,a_t)$

Observation:partial, noisy, ambiguous;often non-Markov

BC 问题:distribution mismatch / error compounding

Reward:scalar immediate feedback;can be sparse or dense

REINFORCE:high variance, still unbiased

reward-to-go:来自 causality,降低方差

baseline:subtracting a baseline is unbiased in expectation

Actor-critic:actor = policy,critic = value function

GAE 常见参数:$\gamma \approx 0.99, \lambda \approx 0.95$

最后检查清单非课件内容
1. 能否不用看笔记说清楚 state、observation、action、reward、policy 的区别。
2. 能否写出 homogeneous transform、Rodrigues、$AX=XB$、P/PD/PID。
3. 能否解释 BC 为什么会 drift,reward-to-go 为什么能降方差,actor-critic 为什么比纯 Monte Carlo 更稳。
4. 能否把 “vision output in camera space” 和 “robot execution in robot space” 之间为何必须 hand-eye calibration 说清楚。
5. 能否口头串起整门课:Embodiment $\to$ robotics basics $\to$ grasping $\to$ policy learning。