DEV Community: Shawn

FutureX · Physical AI Daily — Issue 31 (06/18)

Shawn — Wed, 17 Jun 2026 14:50:15 +0000

Today's Highlights

· Robotaxi global expansion advances on three fronts in a single day: WeRide × Uber launches in Zurich (Europe's second city in two weeks, after Madrid); Stellantis × Wayve × Uber sign a global L4 robotaxi cooperation MoU; Uber × Lucid × Nuro designate Houston as the next city, targeting mid-2027.

· Faraday Future (controlled by Jia Yueting, Chinese EV entrepreneur) unveils a "full-form" embodied robotics lineup across six series, including the $1,990 FX Navi education robot and a new Futurist humanoid, targeting home and K–12 education ecosystems; FFAI shares rose on the news.

· World models continue to attract capital and ship products: Physis (逆矩阵, Chinese world-model startup), founded by Peking University entrepreneur Chen Boyuan (born 2004), closes a seed++ round exceeding $100 million USD, with Matrix Partners China, Wuyuan Capital, BAI Capital, and Ant Group strategic investment participating; on the same day, Alibaba launches real-time interactive world model "HappyOyster 1.0" and AutoNavi releases DreamX-World 1.0.

· Mifeng Technology (觅蜂科技, physical AI data platform spun off from Zhiyuan Robotics) raises a further hundred-million-RMB-range Angel+ round led by Guofang Capital, continuing the independent-incubation playbook built on the thesis that "data is the differentiator"; according to OFweek, cumulative embodied-AI funding in China from January to May has reached approximately 96.6 billion RMB.

· On the research side: Alibaba releases the Qwen-Robot technical reports (RobotManip / RobotNav); ACE-Ego-0 unifies human and robot egocentric data for VLA pre-training and open-sources the model (HF↑39).

I. Research

ACE-Ego-0: Unifying Human Egocentric Video and Robot Trajectories for VLA Pre-training · vla

VLA training suffers from expensive, scarce real-robot trajectories, while internet-scale first-person human video offers ready-made "supplementary supervision." The contribution here is genuinely combining two heterogeneous data types — differing in action space, embodiment structure, temporal dynamics, and annotation quality — into a single pre-training framework rather than training them separately, with an open-source release accompanying the paper.

Hao Li et al. (ACE Robotics × CUHK) · arXiv 2606.17200 source · HF↑39

The team builds a scalable "egocentric video → action" pipeline that converts raw human video into pseudo-action trajectories in robot format, then uses unified representations to align action labels from both human and robot data to a comparable scale for joint pre-training. The companion open-source model ACE-Ego was released on the same day jointly by ACE Robotics and CUHK.

Alibaba Qwen-Robot Technical Reports: Using "Alignment" to Unlock Scalable Robot Foundation Models · vla

Following yesterday's Tongyi "hands-feet-brain" triple release, Alibaba now fills in the methodological details. The core argument: manipulation data is inherently heterogeneous, costly to collect, and narrow in diversity — simply stacking data causes conflicts; alignment across representation, motion, and behavior must come first before multi-source large-scale training "adds rather than cancels" — a key test of whether the formula behind language and multimodal foundation models transfers to robotics.

Haoqi Yuan et al. (Alibaba Tongyi) · arXiv 2606.17846 source (RobotManip) / 2606.18112 (RobotNav) · Commentary: Feynman Bits source (WeChat, CN)

RobotManip, built on Qwen-VL, proposes a unified alignment framework across three dimensions — representation, motion, and behavior — enabling multi-source manipulation data to be jointly trained without mutual interference. RobotNav targets "agent-style navigation systems," providing a scalable navigation backbone whose observation strategy is externally reconfigurable at inference time: instruction following, object search, object tracking, and autonomous driving all share the same perception-planning backbone, but consume visual streams differently; robustness is achieved by randomizing task modes and observation parameters (token budget, per-camera weights) during training. According to a WeChat analysis, Qwen-VLA is effectively a ~5B unified-weight model — roughly a 4B Qwen3 VL backbone with an approximately 1.15B DiT flow-matching action decoder attached.

MuseVLA: Equipping VLAs with On-Demand Multimodal Sensing · vla

Most VLAs consume only RGB and are blind to physical quantities — temperature, sound, radar response — that RGB cannot capture. This paper models the choice of "which sensor to activate and what to attend to" as a tool-call-like action, letting the model decide when to "open a third eye." The approach is more scalable than indiscriminately stacking sensors.

Xingyuming Liu et al. (Peking University / Microsoft et al.) · arXiv 2606.17598 source · Commentary: Non-Embodied Non-Intelligent source (WeChat, CN)

Given a task instruction and visual context, MuseVLA first generates a "sensor token + target description" — equivalent to a parameterized tool call — that decides which sensing modality to invoke and what to attend to; the selected sensor's measurement is then converted into a unified intermediate "sensor image" fed back into the policy. This effectively wires infrared, audio, radar, and other modalities into the manipulation loop as on-demand inputs conditionable on language, beyond RGB.

EgoInfinity: Automatically Converting Arbitrary Web Video into 4D Hand-Object Interaction Data for Robot Learning · manipulation

Internet video is the largest "reservoir" of human manipulation knowledge, but turning arbitrary RGB clips into trainable robot data has been a persistent bottleneck. EgoInfinity is not another static dataset; it is a continuously operating "engine" for producing data — a higher-leverage contribution for open-world manipulation learning, which has long been bottlenecked by data scarcity.

Gaotian Wang et al. · arXiv 2606.17385 source

EgoInfinity is a modular 4D hand-object interaction data engine that chains perception, segmentation, reconstruction, interaction-aware refinement, and retargeting, converting web video into "arbitrary-viewpoint robot retargeting + video-to-action" training data without any human-in-the-loop annotation. Its modular design also lets it benefit continuously from upstream advances in component models.

Uncertainty Quantification for Flow-Based VLAs: Teaching Policies to Know When They Might Be Wrong · vla

VLA action heads trained with flow matching perform strongly but have almost no mechanism to express "I'm not confident about this step." In non-stationary environments outside the pre-training distribution, the model can fail without warning. This paper provides a deployment-ready failure prediction method.

Ralf Römer et al. · arXiv 2606.18043 source

The authors derive an efficient method for estimating epistemic uncertainty: measuring velocity-field disagreement (VFD) across a small ensemble and using it for failure detection and unreliable-action identification. Compared with adding a heavy Bayesian head to a flow model, VFD incurs low computational overhead and is suitable for real-time control loops as a gate for "should I trust this action step?"

Looped World Models: Parameter-Shared Recurrent Transformers that Shrink World Models by 100× · world-model

World models face a fundamental tension: faithful long-horizon simulation demands deep computation, but deeper models are expensive to deploy and prone to error accumulation. LoopWM proposes treating "iterative latent depth" as a new scaling axis orthogonal to "building bigger models / adding more data" — a paradigm option for world modeling worth watching.

Hongyuan Adam Lu et al. · arXiv 2606.18208 source · HF↑5 · Commentary: AI Miaomaofang source (WeChat, CN)

LoopWM is the first recurrent architecture for world modeling: a single parameter-shared Transformer block iteratively refines latent environment states, adaptively scaling "computational depth" to the complexity of each prediction step. It reportedly achieves up to ~100× parameter efficiency compared with conventional approaches at equivalent quality, with spectral constraints ensuring stability across arbitrary rollout lengths.

EBench: Beyond Success Rate — Diagnostic Evaluation for General Mobile Manipulation Policies · benchmark

A single success-rate scalar hides the true capability profile of a policy. EBench decomposes evaluation along two groups of dimensions — capability and generalization — and benchmarks several leading general-purpose policies on a common scale, providing practical value for practitioners choosing between approaches.

Ning Gao et al. · arXiv 2606.18239 source

EBench contains 26 diverse and challenging manipulation tasks annotated across 5 capability dimensions and 4 generalization dimensions, and evaluates models including π₀, π₀.₅, XVLA, and InternVLA-A1. Key finding: models with similar success rates can have drastically different capability profiles — π₀.₅ leads on test success rate and train-test retention; InternVLA-A1 leads on mobile manipulation but collapses on dexterous tasks; XVLA's strong atomic-skill set barely overlaps with the others.

DexLink Hand: A 16-DOF Linkage-Driven Dexterous Hand at 320g and Under $400 · manipulation

Dexterity, compactness, and affordability have long been mutually exclusive — high DOF typically implies complex actuation and transmission that is hard to fit within a human-hand form factor. This hand pushes cost into the low-hundreds-of-dollars range with a form factor amenable to mass production, representing a genuine tooling dividend for dexterous manipulation data collection and scaled research.

Hao Wu et al. · arXiv 2606.17418 source

DexLink Hand integrates 20 joints and 16 independent actuators within a human-hand-sized structure, with all actuation, sensing, and transmission components fully embedded. It uses a hybrid planar-and-spatial linkage mechanism, weighs approximately 320g, and costs under $400 in total, aiming for human-level dexterity with high structural integration and affordability.

Other papers today: CAIP (contrastive visual pre-training that extracts human gestures from egocentric video as end-effector action proxies); ThinkingVLA (interleaved vision-language reasoning, unified autoregressive architecture for prediction + inverse dynamics); PearlVLA (relocating "deep thinking" into VLM latent space, balancing low-latency control with explicit planning); WAM-RL (online interactive reinforcement learning with a world-action model, co-evolving world model and policy); OmniDrive / DRIVE-CHOREO (LLM-orchestrated multi-agent driving world models, multi-view controllable video generation); VERITAS (generator-visual-verifier framework that guides and self-improves general policies at inference time); HumanoidArena (egocentric hierarchical whole-body learning benchmark testing the interface between high-level policy and low-level motion tracker); Damage Adaptation in Seconds (soft/metamaterial robots self-adapting to catastrophic damage within one minute).

Open Source · Tools · Benchmarks

· HRDX Dataset: A large-scale vector HD map dataset covering ~40 hours and 1,400 km of minimally overlapping road segments, with 6-camera surround view + 128-line LiDAR + centimeter-level RTK, paired with precisely aligned aerial orthophotos, 10 categories of vector maps, and 20+ semantic/topological attributes — several times the scale of existing public HD-map datasets (arXiv 2606.17080 source).

· ERQA-Plus: A diagnostic benchmark for embodied AI reasoning, with 1,766 QA pairs anchored to 711 robot-centric images, structured across categories including perception, action, social interaction, navigation-environment, and commonsense consequences — specifically designed to separate "genuine embodied reasoning" from "lucky visual/language shortcuts" (arXiv 2606.17639 source).

· WireCraft: A simulation benchmark for industrial flexible cable (DLO — deformable linear object) manipulation, aligning simulated data, real-robot data, and a unified evaluation protocol — filling the gap left by existing benchmarks that are either hardware-locked or lack industrial fixtures (arXiv 2606.18097 source).

· AnnotateAnything: An annotation framework that automatically converts passive 3D assets into "operable" assets — vision-language reasoning infers object semantics and interaction constraints, then large-scale parallel physics annotation produces executable manipulation labels (arXiv 2606.17446 source).

· DeepInsight: A unified evaluation infrastructure spanning the full Physical AI stack, using a single runtime to host heterogeneous evaluations ranging from single-step foundation model decoding to thousands of physics ticks for whole-body control (a gap of more than three orders of magnitude), enabling cross-layer regression diagnostics (arXiv 2606.17574 source).

II. Funding & Deals

Physis (逆矩阵, Chinese world-model startup) ｜ Seed++ ｜ $100M+ · world-model

Co-investors include Matrix Partners China, Guanghe Ventures, Wuyuan Capital, BAI Capital, and Zhongding Capital, with strategic investment from Ant Group; existing investors Hillhouse Ventures and Peking University-affiliated Yanyuan Capital participated in an oversized follow-on. The company was founded by Peking University entrepreneur Chen Boyuan, born in 2004, betting on "general world foundation models" — AI that understands and predicts how the physical world operates — to serve as a cognitive engine for serious industrial, embodied, and physical simulation applications. Funds will primarily go toward pre-training R&D and scaled training infrastructure, with a flagship model planned for release by end of 2026; the founder states the window for this direction has narrowed to 18 months. Also on the same day: Alibaba's "HappyOyster" and AutoNavi's DreamX-World world model products launched (see Industry section), with both the capital and product tracks of world models continuing to accelerate in parallel.Source: BAI Capital source (WeChat, CN)

Mifeng Technology (觅蜂科技, spun off from Zhiyuan Robotics) ｜ Angel+ ｜ Several Hundred Million RMB · adjacent

Led by Guofang Capital, with follow-on from Futeng Capital, Shanghai Electric Science Fund, and Yuanqi Innovation; existing investors Junpu Intelligence and DCP VGC continued their oversized participation. The company was established approximately four months ago and has already closed two rounds (prior seed/angel round included Sequoia China, Baidu Ventures, and Yunfeng Fund). Mifeng is the independent physical AI data services platform spun off from Zhiyuan Robotics (Chinese humanoid robot maker) in February 2026, occupying the standalone data-supply position that opened up once the closed loop of "robot collects data → trains model → improves robot" was broken apart — continuing Zhiyuan's playbook of spinning off core business units into independently funded subsidiaries targeting sub-verticals. According to OFweek, cumulative embodied-AI funding in China from January to May has reached approximately 96.6 billion RMB.Source: Robot Foresight source (WeChat, CN)

Jijia Vision (极佳视界, Chinese world-model startup) ｜ Series B2 ｜ 1 Billion RMB · world-model

Raises a 1 billion RMB Series B2 round following a string of large financings within three months, and simultaneously launches its first home robot "Shuguang S1." Jijia Vision is pursuing a world model / video generation approach; this round and the new product land simultaneously, marking a step in its extension from foundational models toward consumer home products.Source: InfoQ source

Chengdu Humanoid Robot Innovation Center ｜ New Round ｜ 100M+ RMB · humanoid

Led by Orient Fortune Capital, with Guanghua Open Source and others following; Orient Fortune, an early investor, doubles down again. The round follows a recent strategic investment by a Shudao Group portfolio company and comes as the center has recently secured a major contract in the embodied intelligence space. The pattern of "scenario operator bets → capital follows and oversubscribes" is a typical trajectory for regional humanoid innovation centers that anchor themselves to deployment scenarios in exchange for funding.Source: Chengdu Release source (WeChat, CN)

Noitom Robotics (诺亦腾, Chinese motion-capture company) ｜ Pre-A++ ｜ Several Hundred Million RMB · adjacent

Oversubscribed by multiple leading industrial capital firms and market-rate institutions; investors include the Beijing Artificial Intelligence Industry Investment Fund, the Shanghai Artificial Intelligence Industry Series Fund, Shenzhen Capital Group, and Jianfa Emerging Investment. Noitom's strength is in motion capture, and it is positioning itself in the upstream embodied data collection segment — part of the same thematic thread as several other "no hardware, data/platform only" rounds closing today.Source: Alpha Commune source (WeChat, CN)

III. Commercialization & Deployment

WeRide × Uber Launch Robotaxi Service in Zurich — Europe's Second City · autonomy

Zurich has given the green light for robotaxi operations, with WeRide (Chinese autonomous driving company) and Uber announcing a commercial robotaxi service in the city — becoming Europe's second city in two weeks, after Madrid. The two companies have been operating fully driverless or public robotaxi services in Abu Dhabi, Dubai, Riyadh, and other Middle Eastern cities since late 2024, and are now replicating that experience in developed Western European markets. The announcement was simultaneously covered by more than ten financial news outlets on the same day, making it the one genuinely "live and operational" entry in today's global robotaxi expansion wave.Source: WeRide source (WeChat, CN)

Autonomique Semi-Humanoid Robot Enters Production Deployment at Tier-1 Automotive Supplier F&P · industrial

Canadian company Autonomique announces that its semi-humanoid robot + AI has advanced to production deployment at Tier-1 automotive parts supplier F&P Mfg. The company itself characterizes its focus as stable on-line productivity rather than acrobatic showpieces. Worth noting as a real-world industrial automotive Tier-1 deployment, though single-site production differs substantially from scaled capacity.Source: The Robot Report source

Sanctuary AI Extends Physical AI Strategy to Industrial Robots, Validates at Tier-1 Automotive Supplier · embodied

Sanctuary AI announces the extension of its physical AI capabilities from humanoids to industrial robots, and reports completing "production-ready" performance validation at a Tier-1 automotive supplier. Note that this constitutes capability validation/demonstration in a production-line environment, and remains a significant distance from mass-production installation and scaled delivery.Source: Business Wire source

Industrial Embodied Inspection Robot CASIVIBOT Ships in Volume to Luoyang · industrial

Reported as China's "first" industrial embodied quality-inspection robot, CASIVIBOT has entered volume delivery and is deployed in Luoyang for industrial quality-inspection applications. ⚠️ Vendor claim The "first" designation and the volume figure are single-party assertions; this can be read as a signal of embodied AI penetrating the inspection sub-vertical in China.Source: Sohu source

Raymond × Third Wave Automation: Physical AI Scaled Across Forklift Fleets · industrial

The Raymond Corporation, a major forklift manufacturer, partners with Third Wave Automation to roll physical AI out across Raymond's forklift fleet at scale. Warehouse logistics is one of the fastest segments for embodied/autonomous technology to deliver ROI, and a leading forklift OEM bringing in autonomous capability at fleet scale is meaningfully closer to true scale-out than isolated pilot installations.Source: Robotics & Automation News source

IV. Industry Developments

Faraday Future Unveils "Full-Form" Embodied Robot World, $1,990 Education Robot Enters Market · humanoid

Faraday Future (FFAI), controlled by Chinese entrepreneur Jia Yueting, rolls out a "full-form" EAI robot world spanning six product series at once, launching the all-new Futurist full-size humanoid and the FX Navi quadruped robot priced at $1,990, alongside a so-called "three-in-one" embodied robotics education ecosystem targeting schools and home users. FX Navi features 12 joint motors and uses an iOS/Android smartphone as its compute platform, paired with visual programming and STEM curriculum; the Futurist humanoid claims native support for NVIDIA whole-body motion control. Multiple Chinese media outlets simultaneously reported that FF received approximately $70 million in robotics-related financing; FFAI shares rose intraday following the news. ⚠️ Vendor claims Product specifications, the education ecosystem, and share-price movements are based on company announcements and intraday trading.Source: Business Wire source

Robotaxi Alliance Building: Stellantis × Wayve × Uber Team Up; Lucid × Nuro × Uber Target Houston · autonomy

On June 17, Stellantis, Wayve, and Uber announced a cooperation MoU for the development and deployment of global L4 autonomous robotaxi services: Stellantis contributes L4-ready vehicle platforms (with embedded sensor suites and safety redundancy designed for high-utilization driverless operations); Wayve contributes the AI driving software; Uber contributes its global mobility network. The goal is coverage across Europe, North America, and additional cities, building on Wayve and Uber's existing deployment plans across more than ten cities including London and Tokyo. On the same day, Uber partnered with Lucid and Nuro to designate Houston as the next robotaxi city, targeting mid-2027. Combined with the actual Zurich launch (see above), the "OEM + autonomous software + mobility platform" three-way bundle is emerging as the dominant organizational model for robotaxi scale-out.Source: Stellantis source

Alibaba "HappyOyster" and AutoNavi DreamX-World Launch Same Day, World Models Ship as Products · world-model

Alibaba Cloud launches HappyOyster 1.0, an open-world model product that constructs and enables real-time interaction with a generated environment — claimed to produce interactable digital worlds from a single sentence, capable of inferring "action → feedback" causal chains while maintaining long-range consistency in characters and environment. On the same day, AutoNavi (Alibaba's mapping and navigation subsidiary) releases DreamX-World 1.0, positioned as a general-purpose, multimodal, interactive video world model that integrates camera navigation, long-range scene memory, and composable event control. Continuing the world-model momentum since the recent BAAI Conference, today's focus shifts from "definitional debates at conferences" to large platforms shipping world models as usable products. ⚠️ Vendor claimsSource: AutoNavi Tech et al. source (WeChat, CN)

Dobot to Launch Next-Generation Companion AI Humanoid, "First Listed Cobot Maker" Enters Home Market · humanoid

Dobot (越疆, Chinese collaborative robot maker) previews an upcoming next-generation companion AI humanoid robot, bringing the "perception-reasoning-actuation" closed loop — previously validated in open commercial settings — into the home, and positioning itself as defining a new standard for home embodied intelligence. As the company billed as the "first listed collaborative robot maker" pivots from industrial arms to consumer companion humanoids, it adds another example of cobot manufacturers extending into home scenarios; claims such as "AI can now understand the physical world" reflect marketing language.Source: Sina Finance source

Unitree Opens First Asia Store: Robot Dogs Outsell Humanoids, Diverging Retail Strategies with Zhiyuan · humanoid

Unitree (宇树, Chinese robot maker) opens its first Asia store on a flagship commercial street in Shanghai, using the prime retail location as a brand showcase and direct sales channel — trading top-district foot traffic for consumer visibility. Zhiyuan Robotics (智元, Chinese humanoid robot maker), also on the same street, has chosen a quieter location on Caobao Road, focused primarily on receiving enterprise clients, research teams, and potential partners. On-the-ground feedback indicates robot dogs are selling better than humanoids; Unitree's biggest buyers remain institutional customers that want "one humanoid for display purposes." The two companies' contrasting site selections reflect divergent pre-IPO bets on the consumer-brand versus enterprise-sales pathways.Source: Kechuang Daily source (WeChat, CN)

Big-Tech Executive "Next Stop: Embodied AI" — Qwen Lead Lin Junyang Reportedly Enters the Space · embodied

Chinese media report that Lin Junyang, head of Alibaba's Qwen (Chinese large language model series) team, is joining an embodied intelligence startup valued at approximately 13.5 billion RMB, with the focus most likely on the "brain" model layer rather than physical hardware manufacturing. ⚠️ Single-source report The news has not been officially confirmed; the valuation and positioning are media characterizations. This can be read as one data point in the trend of large-tech executives moving into embodied AI, not a settled conclusion.Source: Guanwang Finance source (WeChat, CN)

Hardware · Supply Chain

· Star Dynamics XHAND 1 PRO (星动纪元, Chinese dexterous hand maker): Launches a 21-DOF fully direct-drive dexterous hand; an independent side-palm degree of freedom improves little-finger opposition precision; maximum finger spread angle 135°, maximum envelope grip diameter over 160mm, capable of grasping large objects such as beer mugs and basketballs. Star Dynamics is an early commercial representative of fully direct-drive five-finger dexterous hands. source (WeChat, CN)

· Boya Intelligence "Gaoshan S1" (伯牙智能, Chinese dexterous hand startup): Founded by a former Alibaba executive Liu Xin together with professors from USTC and Harbin Institute of Technology; headquarters signed into Suzhou AI Industrial Park with a new dexterous hand product launch, backed by a lead investment from Suzhou state capital. Dexterous hands account for approximately 14%–20% of a humanoid robot's total cost, making them one of the highest-value core components. source (WeChat, CN)

· Dexterous Hand Micro-Motor Localization in China: LinkHand 2.0 (灵心巧手, Chinese dexterous hand maker) has switched 60% of its micro-motors to supply from Zhaowei Electromechanical (兆威机电, Chinese micro-motor maker), reducing per-unit cost by approximately 1,800 RMB and cutting total hand cost by roughly 22%; miniature servo motors are becoming a new supply chain bottleneck. source (WeChat, CN)

· Dexterous Hand Demand and Capacity: Industry sources indicate demand for dexterous hands in China this year is up roughly 10× year-on-year, with total volume of approximately 200,000–300,000 units; high-DOF variants (16 DoF and above) account for roughly 10%. LinkHand, as a manufacturer producing thousands of units per month, plans to deliver 50,000–100,000 units in 2026 — already exceeding the approximately 18,000 humanoid robots shipped globally in all of 2025. source (WeChat, CN)

FutureX · Physical AI Daily — Issue 30 (06/17)

Shawn — Tue, 16 Jun 2026 14:57:42 +0000

Today's Highlights

· Alibaba launched Qwen-Robot (Alibaba's embodied AI series), its first embodied large model family under the Qwen line — releasing three components simultaneously: RobotManip (manipulation), RobotNav (navigation), and RobotWorld (world model), marking Alibaba's formal entry into physical AI.

· Galaxea AI (Chinese embodied AI startup) unveiled a full embodied AI suite the same day: open-sourcing next-generation VLA backbone G0.5, announcing world model Fast-WAM and a whole-body control backbone, and debuting its self-developed biped humanoid Kengo.

· Genesis AI launched its first general-purpose robot Eno, betting on a "non-humanoid" wheeled form factor, and announced an enterprise deployment partnership with LG CNS (backed by Eric Schmidt and others; delivery planned by year-end).

· Mobileye announced plans to launch a vertically integrated, self-operated Robotaxi service in the U.S. in 2027, pivoting from supplier to operator; MBLY shares rose in pre-market trading.

· On the capital side: Simple AI (Chinese embodied AI startup) raised hundreds of millions of RMB in a Pre-A round led by Didi; Unitree Robotics (Chinese quadruped and humanoid robot maker) passed its STAR Market IPO review, targeting approximately RMB 4.202 billion in proceeds.

I. Research Papers

GAM: Repurposing a Geometric Foundation Model as a Shared Backbone for Robot Policies · manipulation

Most current VLA and world-action models still operate in 2D image space or derived latent spaces, lacking the 3D geometry that contact-rich manipulation truly requires. GAM directly reuses a pretrained geometric foundation model to inject 3D structure into policies — the highest-trending paper (HF↑80) that day on the "giving VLAs 3D" track.

Jisang Han et al. · arXiv 2606.17046 source

The method uses a pretrained Geometric Foundation Model (GFM) as a shared backbone for perception, temporal prediction, and action decoding: the GFM is split at an intermediate layer, with shallow layers serving as an observation encoder and deeper layers handling temporal prediction and action output. This gives a language-conditioned manipulation policy native 3D geometry rather than relying on 2D frames or 2D-derived representations. The paper argues that "reusing geometric priors" this way is more efficient than stacking additional point-cloud or depth branches, and better supports generalization across viewpoints and contact tasks.

DreamX-World 1.0: A General Interactive World Model with Long-Horizon and "Revisit" Support · world-model

Among general interactive world models, it is rare to combine revisiting previously observed regions, promptable events, and long-horizon generation in a single model that also spans realistic, gaming, and stylized domains; community attention at HF↑70.

DreamX Team · arXiv 2606.16993 source

This is a general text/image-to-video interactive world model supporting camera navigation, revisiting previously observed regions, and promptable events across realistic, gaming, and stylized scenes. Its data engine combines camera-accurate Unreal Engine rendering, action-rich game recordings, and real-world video with recovered camera geometry. The model uses a lightweight projected positional encoding (E-PRoPE) to preserve projected camera geometry, and converts a bidirectional video generator into a few-step autoregressive world model via causal forcing, DMD distillation, and long-rollout training.

Qwen-RobotWorld: An Embodied Video World Model with Language as the Unified Action Interface · world-model

The "brain" component of Alibaba's Qwen-Robot embodied suite: uses natural language as a unified action interface and unifies manipulation, driving, indoor navigation, and human-to-robot transfer into a single model capable of predicting future visual states.

Jie Zhang et al. (Alibaba Qwen) · arXiv 2606.17030 source

The model uses natural language as a unified action interface, predicting physically plausible future visual trajectories from current observations across robot manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. It offers three application directions: synthetic data augmentation for policy training, a scalable virtual environment for policy evaluation, and language-guided planning signals. Architecturally, it uses a Double-Stream MMDiT that couples frozen Qwen2.5-VL semantics with video-VAE latents, using an MLLM for action encoding.

Kairos: A "Native" World Model Stack for Physical AI · world-model

Advances world models from "passive video generators" toward operational infrastructure capable of maintaining state over long horizons and pretraining across embodiments; the Kairos stack is reported to have topped embodied benchmarks including RoboTwin 2.0 and LIBERO-Plus.

Kairos Team et al. · arXiv 2606.16533 source · Commentary: Quantum Bit source (WeChat, CN)

Kairos proposes a native world model stack: a "native pretraining paradigm" combined with a "cross-embodiment data curriculum" that organizes open-world video, human behavioral data, and robot interaction into a progressive development path. Within a unified native architecture, it simultaneously handles world understanding, generation, and prediction, using hybrid linear temporal attention to maintain persistent state over long horizons, with an emphasis on efficient execution under real deployment constraints. This stack represents the world model technical approach recently showcased publicly by SenseTime-affiliated Daai (大晓).

How Should World Models Be Evaluated? A Decision-Centric Position Paper · world-model

In line with recent discussions about the confused definition of "world model," this paper directly targets the most critical flaw in current evaluation: claims about "what a model can be used for" routinely exceed what the evaluations actually support.

Yang Yu et al. (Nanjing University) · arXiv 2606.15032 source

The authors survey the various objects now referred to as "world models" (action-conditioned environment models, latent imagination models, future video predictors, interactive neural simulators, latent predictive representations, synthetic data engines, etc.) and argue that evaluation has generalized along with the terminology, causing recurring "claim/evidence mismatches." The paper advocates recentering world model evaluation around downstream decision-making — executable metrics such as planning success rate and policy improvement.

ViTaL: Integrating Touch into Inference-Time Policy Guidance · manipulation

Inference-time guidance (selecting actions at deployment without retraining the policy) has previously relied almost entirely on vision, but the success of contact-rich tasks often hinges on local interactions such as contact forces. ViTaL incorporates touch into a two-level guidance scheme: high-level mode selection via vision, low-level action refinement via touch.

Yilin Wu et al. (CMU) · arXiv 2606.14981 source

The method frames multimodal guidance as a bilevel optimization: the high level uses visual sample-and-verify for long-horizon mode selection, deciding what behavior the robot should execute; the low level uses tactile-guided diffusion editing to refine the selected action sequence over a shorter horizon to satisfy local constraints such as contact forces. Designed for contact-rich manipulation where vision alone is insufficient to judge success.

ROVE: Post-Training Humanoid VLAs with "Imperfect Human Intervention" · vla

Human intervention is an important correction signal for VLA post-training, but humanoid whole-body kinematics and dexterous hands make intervention trajectories often hesitant, inefficient, or erroneous; learning from these directly as expert demonstrations would bake in bad habits.

Wei Xiao et al. · arXiv 2606.17011 source · Signal HF↑6

ROVE presents a reinforcement learning framework: first, a human-in-the-loop pipeline collects deployment and intervention data for humanoid manipulation; then an optimism-based method performs humanoid VLA post-training on data containing imperfect interventions, avoiding the absorption of hesitant, inefficient, or erroneous intervention behavior into the policy as supervision.

Other papers today: MotionVLA (VLA for humanoid locomotion, frequency-domain analysis for custom multi-codebook, arXiv 2606.15142 source); Metis (general world-action model for autonomous driving/urban navigation, decoupling video generation and action prediction, 2606.15869); TruDi (trust-region diffusion policy supporting large-scale parallel on-policy RL, 2606.15260); AVA-VLA (VLA with latent-variable reasoning and early-exit mechanism, 2606.15099); CausalDrive (real-time causal driving world model, operating from a single frame + ego trajectory + text prompt, 2606.15341); Retrieve, Don't Retrain (using retrieval instead of per-task fine-tuning to scale VLAs to new tasks at test time, 2606.15631).

Open Source · Tools · Benchmarks

· ATOM-Bench: A real-world manipulation benchmark that decomposes tabletop manipulation into "action atoms × instruction atoms" (30 atomic tasks + 24 compositional generalization tasks, single-arm/dual-arm dual track), releasing 3,000 human demonstrations and evaluation rollout data, designed to diagnose the "can do individual skills but can't recombine them" failure mode in general manipulation policies (arXiv 2606.16826 source).

· Junpu Intelligent × Boden × Shanghai Jiao Tong University: Jointly released a large-scale dataset for real-robot reinforcement learning, targeting the persistent data scarcity in physical-robot RL source.

· AgiBot GO-2 (AgiBot, Chinese humanoid robotics company): Open-sourced as the official baseline model for the AGIBOT WORLD CHALLENGE, available for secondary development by developers worldwide source (WeChat, CN).

II. Funding & Deals

Simple AI (深朴智能) ｜ Pre-A ｜ Hundreds of Millions of RMB · embodied

Led by Didi, with follow-on participation from Meihua Ventures, Keli Sensing, and continued investment from existing backers Creation Partners, Linear Capital, and Puhua Capital. Simple AI focuses on general embodied intelligence robots; the company reportedly closed four rounds within a year, making this the most-watched funding round in China's embodied AI space that day. Source: Zhidongxi source

Unitree Robotics (宇树科技) ｜ STAR Market IPO Approved ｜ ~RMB 4.202 Billion Targeted · humanoid

Unitree Robotics successfully passed its STAR Market IPO review, targeting approximately RMB 4.202 billion in proceeds — a further step toward capital markets following mass production of quadruped robot dogs and humanoid robots. Multiple outlets described the review timeline as "73 days to approval." Source: Sina Finance source

Limitless Labs ｜ Series A ｜ $20 Million · industrial

Building physical AI foundation models and factory-focused AI agents for precision manufacturing; this round will be used to expand the company's embodied foundation model. Targeting manufacturing verticals with an embodied AI backbone is another example of the B2B route. Source: SiliconANGLE source

Pegasus Tech Ventures × CYBERDYNE ｜ Corporate Venture Fund ｜ ¥10 Billion · adjacent

The two parties established a ¥10 billion CVC fund dedicated to "HCPS Cybernics × Physical AI." CYBERDYNE is known for its HAL exoskeleton; the fund aims to accelerate startups in human-cyber-physical systems. Source: Business Wire source

NOITOM Robotics (诺亦腾, Chinese motion-capture and robotics company) ｜ Pre-A++ · adjacent

Investors include the Beijing Artificial Intelligence Industry Investment Fund, the Shanghai AI Industry Series Fund, Shenzhen Capital Group, Jianfa Emerging, CICC Capital, Kunlun Capital, and Yuanhe Puhua, with existing shareholders adding to their positions. NOITOM began in motion capture and is expanding into embodied data collection. Source: Shicheng Capital source (WeChat, CN)

Qiankong Embodied Intelligence (潜空间具身智能) ｜ Seed Round ｜ Tens of Millions of RMB · embodied

Led by Hanhui Capital alongside industry partners; the company focuses on "software-defined robot motion" — enabling robots to move via a no-code platform. Proceeds will fund platform R&D and team expansion. Source: PKU Youth CEO Club source (WeChat, CN)

Hot Money Returns to Collaborative Robots: FAIR-Innovation, Elite Robot, Realman, Flexiv, and Tianjee Intelligence Close Rounds in Quick Succession · industrial

Into 2026, FAIR-Innovation Robotics, Elite Robot, Realman Robotics, and Flexiv each announced new funding rounds, with Tianjee Intelligence (Chinese collaborative robot maker) standing out with a RMB 1 billion Series B. Collaborative robots' relatively contained costs and well-defined deployment scenarios have made them a renewed focus for capital. Source: Gaogong Robot source (WeChat, CN)

III. Commercial Deployment

AgiBot Cumulative Shipments Surpass 10,000 Units; ~RMB 5 Billion National Special-Purpose Robot Base Enters Construction Sprint · humanoid

AgiBot (Chinese humanoid robotics company) announced cumulative shipments exceeding 10,000 units (approximately 4,900 added in the first half of the year — a near-doubling in six months), making it the first player in the industry to reach "10,000-unit delivery," while targeting RMB 10 billion in revenue by 2027. Its approximately RMB 5 billion national special-purpose robot base is entering its construction sprint, with capacity being built out across Lingang, Chengdu Pidu, Zhengzhou, and other locations. (Earlier demonstrations of the Yuanzheng A3 autonomously playing table tennis have been covered previously and are not repeated here.) Source: Yuandong Robot source (WeChat, CN)

Chengdu Humanoid Robot Innovation Center Signs Order for 5,000 Units · humanoid

The Chengdu Humanoid Robot Innovation Center announced it has signed a supply order for 5,000 units, signaling a move from showcase to volume procurement for humanoid robots. ⚠️ Single-party claim ("largest single supplier order in China's embodied AI sector" is a self-reported figure). Source: Sohu source

BMW Tests Humanoid Robots at Leipzig Factory · industrial

BMW is testing humanoid robots at its Leipzig factory in Germany, framing them as "tireless colleagues," joining other automakers trialing humanoids on production lines. The deployment remains at the factory pilot stage, not yet scaled. Source: VISION mobility source

Humanoid-Operated Convenience Store to Launch in Hong Kong · humanoid

According to foreign media reports, a convenience store operated by humanoid robots is set to launch in Hong Kong, extending humanoids from factory floors and guided tours into retail front-of-house. Actual operational performance and human-robot task division remain to be observed. Source: People.com source

Amazon to Build £500 Million Cross-Dock Hub in the UK, Processing ~20 Million Parcels per Week · industrial

Amazon plans to build a cross-dock center in the UK with an investment of approximately £500 million, processing roughly 20 million parcels per week, continuing the expansion of warehouse and logistics automation toward ever-higher throughput. Source: Trans.INFO source

IV. Industry Developments

Alibaba Launches First Embodied Large Model Series Qwen-Robot: "Hands, Feet, and Brain" Released Simultaneously · world-model

The Qwen family's first complete embodied intelligence model series consists of three components — Qwen-RobotManip (manipulation, VLA), Qwen-RobotNav (navigation), and Qwen-RobotWorld (world model) — corresponding to the robot's "hands, feet, and brain," marking Alibaba's formal move from conversational AI in the digital world into the physical world. According to public analyses, the VLA component is built on a Qwen3.5-4B vision-language backbone paired with an approximately 1.15B DiT flow-matching action decoder, unifying manipulation, navigation, and trajectory prediction into a single action-trajectory prediction framework conditioned on "embodied perception prompts," and claimed to be reusable across tasks, environments, and embodiments. Alibaba positions it as a "standardized control backbone" for robotics. The announcement was covered by more than seven English-language outlets the same day; however, Alibaba's Hong Kong-listed shares briefly dipped in pre-market trading, reflecting market uncertainty about the commercial monetization of such a "backbone." Source: Robot Outlook source (WeChat, CN)

Galaxea AI Unveils Full Embodied AI Suite: Open-Sources G0.5, Announces Fast-WAM, Debuts Biped Humanoid Kengo · humanoid

At an embodied intelligence event on June 16, Galaxea AI simultaneously released and open-sourced its next-generation VLA foundation model G0.5, announced its world model Fast-WAM and a whole-body control foundation model, and gave the first public showcase of its self-developed biped humanoid Kengo, while declaring the completion of a "full-stack hardware + intelligence" strategic loop. The founder outlined a three-stage framework: "instinctive intelligence — operational intelligence — evolutionary intelligence," in which instinctive intelligence acts directly on the body (balance, walking, running), with operational and evolutionary intelligence layered on progressively, emphasizing there are no shortcuts. Simultaneously showcasing an open-source backbone model, a world model, and a self-developed humanoid is how Galaxea AI differentiates itself from players that focus solely on "the brain" or solely on "the body." Source: Robot Outlook source (WeChat, CN)

Genesis AI Launches First General-Purpose Robot Eno, Betting on a "Non-Humanoid" Wheeled Approach · embodied

Genesis AI, backed by Eric Schmidt and others, launched its first general-purpose robot Eno, using a wheeled rather than bipedal design, with an integrated hardware body and an AI brain called GENE. The company plans to deliver to customers by year-end. Genesis AI publicly questions the current humanoid hype, arguing that a more pragmatic wheeled mobile manipulation approach is better suited for enterprise scenarios. The same day, it announced what it called an "industry-first" strategic partnership with LG CNS for scaled enterprise robot deployment. AFP, Reuters, and Forbes followed closely; Forbes went as far as asking whether this is "the iPhone moment for humanoid robots" — a media framing more than a validated verdict; Eno's actual capabilities and delivery remain to be proven. Source: The Robot Report source

Current Robotics Releases Whole-Body Dexterous Manipulation Model Curr-0: Unified Weights for Mobile and Fine Manipulation · embodied

Curr-0 uses a single policy to end-to-end couple walking, whole-body posture coordination, and fine hand manipulation — so the robot does not need to "walk into position and then operate," but instead coordinates the whole body and hands in real time while moving. The model was trained on 21,000 hours of real human behavioral data (including 2,800 hours of whole-body teleoperation), collected via the company's self-developed HumanEx whole-body exoskeleton system — allowing humans to naturally complete tasks in real environments while wearing the exoskeleton, shifting data growth from "robot deployment hours" to "human task hours." The official demo shows the robot completing tasks such as tearing open a tea bag, lighting incense, stamping a seal, and squatting to place a stuffed toy in a basket — capability demonstrations that remain at a distance from scaled deployment. Source: Quantum Bit source (WeChat, CN)

Mobileye Announces Self-Operated, Vertically Integrated U.S. Robotaxi Service for 2027 · autonomy

Autonomous driving technology supplier Mobileye (MBLY) announced plans to launch its own vertically integrated Robotaxi service in a U.S. city in 2027, extending its role from supplier to operator — news that drove its U.S. shares higher in pre-market trading. This represents a significant departure from the company's long-standing position as an ADAS and autonomous driving technology supplier, and is being read as adding another integrated player alongside Waymo and Tesla; the 2027 timeline is a stated target. Source: CNBC source

Li Auto Launches Self-Developed Chip Mach M100; CEO Li Xiang Defines "Embodied Intelligent Vehicle" · autonomy

Li Auto (Chinese EV maker) released its self-developed chip Mach M100 and defined the "embodied intelligent vehicle" as an agent combining four capabilities: an electric vehicle, a professional driver, an AI computer, and a life assistant. CEO Li Xiang stated that smart cars are "not yet smart enough" and set a target for the Mach VLA to match Tesla FSD V14 by year-end. This is another example of an automaker extending the "embodied intelligence" narrative from robot bodies back to the car itself; the year-end FSD comparison is a stated target requiring verification against actual software versions. Source: Sina Finance source

JD.com Establishes Embodied Intelligence Research Institute in Suqian, Continuing to "Anchor to the Physical World" · embodied

JD.com announced the establishment of an Embodied Intelligence Research Institute in Suqian, extending its stated strategy of "AI value reconstruction anchored to the physical world," as the e-commerce and logistics giant continues directing resources toward embodied AI. Source: Gasgoo source

ABB Robotics Partners with PSYONIC to Close the Dexterity Gap Using Human-Generated Data · embodied

ABB Robotics has partnered with PSYONIC, a bionic prosthetic hand company, to leverage the human manipulation data accumulated through PSYONIC's bionic prosthetic hands to improve robot fine manipulation and tactile dexterity — echoing the day's broader industry discussion that "dexterous hands are a bottleneck for embodied AI." Source: The Robot Report source

Hardware · Supply Chain

· Dexterous Hand Cost and Lifespan: The per-unit cost of dexterous hands in the industry still exceeds $6,000, with a usable lifespan of approximately six weeks — identified as one of the key bottlenecks to humanoid mass production source.

· Falling Robot Prices: Consumer-grade humanoid prices have dropped to around RMB 9,998; the main driver of cost reduction is Chinese manufacturers leveraging automotive-grade motors, reducers, and cameras from existing supply chains source.

· Cost Baseline: McKinsey estimates the current per-unit cost of humanoid prototypes at approximately $150,000–$500,000 and identifies supply-chain cost reduction as the decisive factor in bridging the commercialization gap source (WeChat, CN).

· Automotive Technology Reuse: Industry estimates suggest approximately 70% of the technology in building cars and humanoid robots (high-performance drive motors, reducers, sensors) is shared — a key reason automakers are entering the embodied AI space in force source.

FutureX · Physical AI Daily — Issue 29 (06/16)

Shawn — Mon, 15 Jun 2026 14:47:08 +0000

Today's Highlights

· Marine embodied intelligence becomes a new capital frontier: Shihang Intelligent closes an A-round exceeding 1 billion RMB, setting a global single-round funding record for marine robotics, with Zhu Xiaohu making his fifth consecutive investment and Temasek among new backers.

· World models continue to attract major funding: SenseTime-affiliated Daxiao Robotics accumulates hundreds of millions of dollars in the first half of the year and releases the home world model Kairos-HomeWorld, while GigaAI secures another 3.5 billion RMB over three months.

· Zhiyuan Yuanzheng A3 claims "fully autonomous" table tennis play against humans — no remote control, no scripting, no human intervention — a high-difficulty dynamic closed-loop capability demonstration (vendor claim).

· Humanoid robots debut across Chinese industry: SERES humanoid "Xiao Sai" makes its first appearance at a super factory; Songyan Dynamics releases its first open-source HarmonyOS consumer humanoid N2; Huawei puts humanoids on the HarmonyOS ecosystem.

· On the research side, "world model / world-action model" papers cluster: μ₀ replaces pixel prediction with 3D trajectory forecasting; Tencent Robotics X open-sources the full-stack VLA HyVLA-0.5.

I. Research Papers

μ₀: A Scalable 3D Interactive Trajectory World Model · world-model

Current world models follow two main paths, each with inefficiencies: pixel-space video models spend compute on dense appearance reconstruction, while direct action models require embodiment-specific action labels and are hard to scale. μ₀ offers a third route — predicting only the motion of the few points where interaction will occur.

Seungjae Lee et al. · arXiv 2606.13769 source · Commentary: SourceMind source (WeChat, CN)

μ₀ predicts neither dense pixels nor direct actions. Instead, it forecasts smooth 3D trajectories for key interaction points — objects, tools, hands, contact regions — forming a compact, embodiment-agnostic motion interface. The accompanying TraceExtract system automatically selects keypoints from diverse video sources and constructs 3D supervision signals, enabling training on heterogeneous video without action labels before transferring to specific robots.

Hunyuan Hy-Embodied-0.5-VLA: A Full-Stack System from VLA Model to Real-Robot Learning · vla

Rather than another benchmark-chasing VLA, this is Tencent Robotics X open-sourcing the entire pipeline — data collection, model, pretraining/fine-tuning, RL post-training, and real-robot deployment — making its engineering value greater than any single metric.

He Zhang et al. (Tencent Robotics X) · arXiv 2606.14409 source · Commentary: Jiqizhixin source (WeChat, CN) · HF↑6

The paper covers every stage of the full robot learning stack. On the data side, it uses sub-millimeter fingertip UMI interfaces for collection, eliminating heavy leader-follower teleoperation. On the post-training side, it is the first to systematically introduce Proximalized Preference Optimization (PRO) into flow-matching-based VLA reinforcement training, directly leveraging real robot failure data, and claims near-100% success rates on real-robot tasks. The model and methods are open-sourced.

EQRL: Elastic Execution Scheduling for VLAs Based on Task Difficulty · vla

Existing VLAs apply fixed denoising steps and replanning cadences regardless of whether the current state involves free-space translation or contact alignment — spreading compute evenly across states of unequal difficulty. EQRL makes "how long to compute" a learnable decision.

Ge Wang et al. (Ising AI & CUHK-Shenzhen) · arXiv 2606.14375 source · Commentary: Embodied Intelligence Chat source (WeChat, CN)

EQRL uses a lightweight latent-schedule adapter to jointly select latent inputs, denoising budgets, and action chunk lengths without fine-tuning the underlying VLA. A trained critic gives the scheduler difficulty awareness — hard or contact-dense states get more compute and more frequent feedback; easy states use less inference and longer open-loop execution. The commentary reports an approximately 32% reduction in inference cost.

WAM4D: Fast 4D World-Action Model via Spatial Register Tokens · world-model

Most world-action models operate in 2D video or latent space. Predictions that "look plausible" lack 3D spatial constraints and the contact geometry of occluded regions, making them insufficient for precise manipulation. Yet forcing models to decode dense 4D geometry slows causal action generation. WAM4D aims to have it both ways.

Ying Li et al. · arXiv 2606.14048 source

WAM4D introduces lightweight "spatial register tokens" as training-time readout points for future depth, transferring 3D priors from pretrained geometric foundation models into a causal video-action model. This allows action predictions to carry 3D and contact geometry constraints without expensive dense geometric decoding, while maintaining fast inference.

ContactWorld: Key Ingredients for Visuo-Tactile World Models in Contact-Rich Manipulation · world-model

What representations actually support long-horizon planning in contact-rich tasks has lacked a systematic answer. This paper uses a new benchmark to empirically settle the question of which representation to choose.

Zhiyuan Zhang et al. · arXiv 2606.13877 source

The authors build a benchmark covering 12 categories of contact-rich tasks including insertion, disassembly, screw-tightening, and exploratory interaction, and systematically compare visuo-tactile world models. The conclusion: representations that are both "spatially structured" and "temporally continuous" plan most reliably. Point cloud observations raise average planning success rates significantly above wrist-camera views (20.7% and 22.0%), highlighting the value of structured geometric information for contact reasoning.

Output-Layer Regularization Eliminates the "Random-Seed Lottery" in Single-GPU VLA Fine-Tuning · vla

Same code, same data, only the random seed changes: run 13 times, 12 land stably at 91–94%, one quietly drops to 65.2% — a 29-percentage-point collapse with no errors and no warnings. This paper names the problem, localizes the cause, and offers a cheap fix that is highly practical for practitioners.

Jeffrin Sam, Dzmitry Tsetserukou (Skoltech) · arXiv 2606.13856 source

The authors call this phenomenon the "seed lottery" and trace its root cause to "output collapse": the action predictor learns to ignore its input and produce nearly identical actions. Weight-space methods such as L2 and EWC fail structurally here — they penalize weight changes, but collapse occurs along directions where weights barely shift. Switching to output-layer regularization eliminates the collapse.

DiPOD: Preventing Diffusion Policy "Drift" During RL Post-Training · manipulation

RL post-training is increasingly critical for improving diffusion policies, but existing diffusion policy gradient methods are often unstable and unreliable. This paper from Berkeley identifies the mechanism behind the instability and provides a simple, practical fix.

Haozhe Jiang et al. (UC Berkeley) · arXiv 2606.13795 source

The authors identify "dual drift": optimizing the variational surrogate objective causes the ELBO to diverge from the true log-likelihood, which in turn causes the surrogate policy gradient to deviate from the true reward policy gradient. DiPOD alternates between self-distillation and policy improvement updates during training, equivalent to adding an on-policy ELBO regularization term to each diffusion policy gradient step, maintaining tight bounds and stable improvement throughout.

RT-VLA: Real-Time Autonomous Driving VLA via Knowledge Distillation · autonomy

VLA end-to-end joint modeling of perception, language reasoning, and action is promising, but the inference latency of large vision-language backbones makes real-world deployment impractical. This paper uses distillation to compress the capability into a model that can run in real time.

Xiangyu Huang et al. (CMU) · arXiv 2606.14010 source

RT-VLA uses multi-level supervised distillation to transfer the driving and reasoning capabilities of the state-of-the-art driving model SimLingo into a compact student model. Post-hoc language analysis of safety-critical moments is performed offline to preserve interpretability without adding real-time control latency.

Other papers today: Multi-Agent Embodied Autonomous Driving (survey, unifying V2X cooperative driving under "shared world models," covering 380+ references); PhysVLA (physics-constraint plugin at inference time, wrapping any frozen VLA with <1ms per-step overhead); Spatially Conditioned Diffusion Policy (precise and robust manipulation from a single RGB camera); Universal Manipulation Exoskeleton (upper-limb teleoperation collection with real-time torque feedback); EgoGuide (first-person demonstration collection without a robot); GAIT (legged robot proprioceptive state estimation with inertial-leg token attention); Robust Fall Recovery (force-guided fall recovery for armless bipedal wheeled robots); Semidefinite Relaxations for Collision-Free Motion Planning (Russ Tedrake et al., theoretical analysis of semidefinite relaxations for collision-free motion planning); ReactVLA (lightweight low-latency reactive manipulation with improved Mean Flow).

Open source · tools · benchmarks: ORCA (open-source dexterous hand research full stack integrating control/simulation/teleoperation/retargeting into the robot learning ecosystem, arXiv 2606.14561 source); Kine2Go (Unitree Go2 multi-gait kinematics dataset, arXiv 2606.14433 source); ContactWorld (the visuo-tactile world model benchmark above, 12 categories of contact-rich tasks).

II. Funding & Deals

Shihang Intelligent | Series A | Over 1 billion RMB · embodied

A Suzhou-based marine embodied intelligence company with a fully in-house stack covering propulsion, control, sensing, navigation, sealing, and deployment across six systems, focused on complex underwater environments. New investors include Moore Threads and Kunlun Chip-backed Shanghe Momentum Fund, Temasek's Vertex Growth, CITIC Group's agricultural industry fund, Yuzun Capital, and listed company Dayang Electric, with existing investor GGV Capital and others following on. Zhu Xiaohu has now invested in the company five consecutive rounds. The company claims this is the largest single-round raise in global marine robotics, with first-half orders exceeding 1 billion RMB. Ocean is a "long tail of the physical world" that capital has largely overlooked — this round opens a new front beyond the crowded land-based humanoid space.Source: IPO Early Notice source (WeChat, CN)

Daxiao Robotics | Angel+ Round | Hundreds of millions USD (H1 cumulative) · world-model

An embodied intelligence company under SenseTime, led by co-founder Wang Xiaogang, positioned as a supplier of robot "brains" and intelligent core components. Its flagship product is the Kairos world model 3.0; in collaboration with CUHK and the Shenzhen River Loop Academy, it has released Kairos-HomeWorld, a world model framework for whole-home generation and full object interaction. This round brings in Ant Group, Geely Capital, Dachen Caizhi, Shenzhen Capital Group, Qiming Venture Partners, SenseTime Guoxiang, MooreThreads, and Lenovo Ventures, among other industrial and financial backers. Founded in July 2025, the company has raised hundreds of millions of dollars in cumulative H1 funding and is described as one of the embodied AI sector's "fastest unicorns."Source: ZhangTong Society source (WeChat, CN)

GigaAI | New Round | 1 billion RMB (approx. 3.5 billion RMB cumulative over three months) · world-model

Founder Huang Guan argues that "physical AGI will act directly on the real physical world"; the company enters through world models and video generation. Three consecutive funding rounds over three months total approximately 3.5 billion RMB, with top-tier global capital concentrating its bets — another data point showing that world model momentum is flowing from papers and conferences into the primary market.Source: Zhidongxi source

Xuanji Dynamics | Strategic Investment | Amount undisclosed · industrial ⚠️ Single-party claim

An embodied intelligence robotics company that has received strategic investment from SAIC Group, Dongfang Precision, and others. Following several automakers' cross-sector moves, OEM and industrial capital continue to stake positions in the embodied supply chain via equity stakes. The amount and ownership percentages have not been fully disclosed.Source: Guandian.cn source

Noitom Robotics | Pre-A++ Round | Amount undisclosed · adjacent

Positioned as "a robotics company that doesn't build robots," Noitom provides data infrastructure for embodied intelligence and humanoid robots; ModalityNet has gone live and the company continues advancing embodied data industrialization. It says its next round will open soon. Data collection and labeling are becoming the next capital-intensive layer after hardware.Source: Jinxiu Science Park source (WeChat, CN)

MW (Japan) | Seed Round | $21 million · adjacent

Japanese startup MW has closed a $21 million seed round to build "homes designed for physical AI" — embedding robot accessibility, sensing, and interaction capabilities into residential spaces from the ground up. Modifying the home before deploying the robot is an uncommon but pragmatic approach to the household scenario.Source: AI Insider source

III. Commercial Deployment

Middle East's First Hydrogen-Powered Autonomous Heavy Truck Enters Operation · autonomy

A hydrogen-powered autonomous heavy truck developed with the participation of Refire Energy has entered actual operation in the Middle East, combining autonomous driving and hydrogen propulsion for highway and port freight. The Middle East has recently become a significant destination for Chinese autonomous heavy trucks and Robotaxi services expanding internationally.Source: China Hydrogen Energy Industry Promotion Association source

Chinese Greenhouse Tomato Harvesting Robot Enters European Trial · embodied

A Chinese greenhouse tomato harvesting robot has secured a European trial, advancing visual grasping from laboratory demonstration to field validation in a real cultivation environment. Agricultural harvesting is one of the few manipulation robot segments with meaningful existing willingness to pay at scale; an overseas trial is a key step toward commercialization.Source: Hortidaily source

Marine AI Welding Robot Completes 30-Tonne Component Operations · industrial

Xinhua reports that an AI robot in China completed autonomous welding of 30-tonne large components in an offshore engineering project, supporting a range of inshore-to-offshore underwater and surface operations. This echoes today's Shihang Intelligent funding news: marine manufacturing and marine embodied intelligence are simultaneously attracting industrial and capital attention.Source: Xinhua source

JD.com Partners with Two Platforms in One Month, Doubling Down on Robot Leasing · adjacent

JD.com has partnered with two robot platforms within a single month, expanding its robot leasing business. At a stage when hardware prices remain high and corporate procurement is cautious, leasing and as-a-service models are a key commercial lever for lowering deployment barriers and accelerating real-world adoption at scale.Source: Sohu source

IV. Industry Developments

Zhiyuan Yuanzheng A3 Claims "Fully Autonomous" Table Tennis Against Humans · embodied ⚠️ Vendor claim

Zhiyuan Robotics says its full-size bipedal humanoid Yuanzheng A3 has played table tennis against a human with no remote control, no scripting, and no human intervention, self-reporting a hit rate of approximately 91% and claiming to be "the world's first" full-size bipedal humanoid to achieve this. Table tennis demands near-human real-time performance across the perception-decision-execution loop, making it a meaningful demonstration of capability in dynamic unstructured environments. However, "world's first" and "91%" are unverified vendor claims from a single demonstration, and demonstration capability is separate from production readiness or scalable deployment. Sony's earlier AI table tennis robot serves as a comparable prior example.Source: Jiemian News source (WeChat, CN)

SERES In-House Humanoid "Xiao Sai" Debuts at Super Factory · humanoid

SERES Group Vice President Kang Bo released a video unveiling the company's in-house humanoid robot "Xiao Sai," which performed visual recognition and voice interaction as a tour guide inside a super factory (accompanying actor Huang Bo on a visit). The company also disclosed that multiple logistics and quality-inspection robots for factory scenarios are already deployed on production lines, with plans to release additional bipedal, quadruped, and other embodied robots later this year. Automakers' existing supply chains and ready-made deployment venues give them a natural advantage in entering the humanoid space; SERES's entry further enlarges the field of cross-sector automotive players.Source: Embodied Intelligence HQ source (WeChat, CN)

Songyan Dynamics Releases First Open-Source HarmonyOS Consumer Humanoid N2, Launches "100 People, 100 Robots" Program · humanoid

Songyan Dynamics has released N2, described as the industry's first consumer humanoid robot integrated with open-source HarmonyOS, alongside a "100 People, 100 Robots" developer co-creation program — selecting 100 developers to receive free robots. Open-source HarmonyOS integration means voice interaction can connect with air conditioners and other smart devices across multiple platforms. Pairing a consumer humanoid with an open operating system ecosystem is a move to lower development barriers and compete for developer mindshare.Source: Phoenix Online source

Huawei HDC2026: Humanoid Robots Run on HarmonyOS, Cross-Platform Device Integration Demonstrated · embodied

At Huawei Developer Conference 2026, robots connected to open-source HarmonyOS — including robot dogs and humanoids — appeared on stage, demonstrating the ability to use robots as entry points that control home and office devices. Huawei's strategy is to hold the "foundation layer" of embodied intelligence through its operating system and ecosystem position, rather than building complete robots itself.Source: Sina Finance source

Li Auto Livis Day: Mach VLA Targets Tesla FSD V14 Parity in Q4 · autonomy ⚠️ Vendor claim

Li Auto held its Livis Day software and embodied intelligence launch event, announcing comprehensive upgrades across software and embodied AI. The head of autonomous driving said the in-house Mach VLA continues to evolve, with a stated goal of matching Tesla FSD V14 in Q4. "Matching FSD V14" is a public self-benchmark; no third-party evaluation on the same criteria exists, and real-world performance remains to be validated by Q4 vehicle testing.Source: Sina Finance source

Parallel Systems Advances "World's First Autonomous Freight Rail System" · autonomy

Parallel Systems, founded by former SpaceX engineer Matt Soule, is building electric autonomous freight cars with no locomotive and no on-board driver, compatible with existing freight and train control software and capable of autonomous coupling and decoupling. The company has raised over $100 million in total, holds 300+ vehicle orders, and has received Federal Railroad Administration (FRA) approval to conduct the first autonomous freight rail system test, targeting initial commercial operations in 2026. Shifting short-haul freight from road to automated rail is a distinct route in freight autonomy, separate from the trucking path.Source: Robotics & Automation News source

"DeepMind Partner" Dishwashing Humanoid Video Admitted by Founder to Be Fake · humanoid

A video appearing to show a humanoid robot doing kitchen chores went viral before being confirmed as an AI-generated promotional film. Fabian Kerj, founder of Qualia — which is part of Google DeepMind's European robotics partnership program — admitted it was not a real robot, saying "we build training infrastructure, not hardware," adding "but it got your attention." The incident once again puts the authenticity of humanoid robot demonstrations and marketing honesty under scrutiny — earlier cases of humans posing as robots were widely cited in response.Source: The Cool Down source

Yinghe Robotics Reported Near Collapse; Panda Capital Applies Public Pressure · industrial ⚠️ Single-party claim

Yinghe Robotics, which has raised approximately 600 million RMB, is reported to be in serious operational difficulty, with investor Panda Capital publicly directing blame at parties affiliated with Midea. Specific operational details and the dispute are based on a single party's account and await responses from other parties. Even as primary-market capital floods in, exit and governance risks for early-stage projects are beginning to surface — the other side of a sector experiencing simultaneous boom and stress.Source: Sina Finance source

Hardware & supply chain: Competition to mass-produce dexterous hands is intensifying — Yinshi Robotics claims it delivered over 10,000 dexterous hands in 2025, calling itself the first company globally to exceed 10,000 annual units with over 60% market share (vendor claim); LinkerBot says its single-month peak output exceeds 4,000 units across tendon-driven, linkage, and direct-drive product lines; ICRA 2026 saw a surge of new dexterous hand entrants and products, with component supply becoming more predictable than the hardware body itself. Upstream precision reducer substitution in China is entering a "golden window" — performance approaching Japanese leaders at significantly lower cost — widely viewed as one of the key bottlenecks to cost reduction at humanoid scale.

FutureX · Physical AI Daily — Issue 28 (06/15)

Shawn — Sun, 14 Jun 2026 14:48:31 +0000

Today's Highlights

· XPENG issued an internal letter with He Xiaopeng personally taking over as CEO of the robotics division, announcing the company's transformation from an intelligent vehicle company into a "Physical AI company." The IRON humanoid robot is set for mass production by end of 2026, entering XPENG showrooms as a sales guide in Q1 2027, with full-stack in-house R&D maintained from chips to articulated dexterous hands.

· "Nurturing the brain" becomes capital's new focus: Hillhouse Ventures exclusively backs MoYa (Hongfire Intelligence / SoulX) — a soft companion robot entering through the sleep scenario — at the angel round; Jianzhou Robotics secures hundreds of millions of RMB led by Ant Group, DiDi, and Delian Capital, setting a new funding record for the "embodied data without robot hardware" segment.

· Hong Kong IPO pipeline expands simultaneously: EngineAI reportedly filed confidentially with HKEX at a valuation exceeding CNY 10 billion; highway-logistics autonomous driving firm Zhuganxian Technology re-filed; harmonic reducer manufacturer LAIFUAL Drive passed its listing hearing.

· University of Maryland's HumanEgo used only 30 minutes of human first-person video and zero robot data to achieve a 92.5% success rate across four real-world tasks on a bimanual robot, with zero-shot transfer to different embodiments, cameras, and scenes.

· Deployment milestone: Junpu Intelligent rolled off the first batch of G2 robots, equipped with the GO-1 embodied foundation model, completing 2,283 tasks over 8 consecutive hours in a 3C factory with zero errors; a thousand-unit order has been locked in.

I. Research Papers

HumanEgo: 30 Minutes of Human Egocentric Video, Zero-Shot Teaching of Bimanual Robots · manipulation

This work moves the robot's "data interface" from lab teleoperation to a pair of smart glasses — relying on no robot data, no robot post-training, and no internet-scale pretraining, learning deployable policies directly from minutes of human video. It is the most radical step yet along the route of learning manipulation from human video.

Zhi (Leo) Wang et al. (University of Maryland) · arXiv 2605.24934 source · Commentary: Jiqizhixin source

The core idea is "grounding representations in interaction rather than in the body": Meta Aria glasses capture egocentric video with 6-DoF head trajectories and 3D hand keypoints; for each hand and each object in the scene, a 29-dimensional Interaction-Centric Token (ICT) is computed, encoding the entity's 6D pose in a reference frame along with hand-relative-to-object pose and grasp state. The human hand is retargeted and abstracted as a "virtual two-finger gripper," and visual processing that masks out the human arm and renders a virtual gripper eliminates the appearance and kinematic gap between human hands and robot end-effectors; during occlusion, kinematic locking maintains object pose continuity. The policy uses flow matching with three dense auxiliary objectives — object motion prediction, 2D trajectory regression, and latent consistency — enabling efficient learning from as few as ~60 trajectories. Across four task categories — pick-and-place, long-horizon stacking, contact-rich bimanual coordination, and continuous rotation — the method achieves a 92.5% success rate, outperforming an equal-duration teleoperation baseline by 41 percentage points.

Multilingual Instructions Expose "Stepped" Fatal Weaknesses in VLA Models · vla

Mainstream VLA multimodal backbones are theoretically cross-lingual, but this has never been systematically verified. This paper reframes "language robustness" from a static model capability into a dynamic temporal control problem during execution — and proposes a repair method that intervenes only at inference time without retraining.

Harbin Institute of Technology · ACL 2026 Main Conference Accepted · Commentary: Roushen Algorithm source

The team translated LIBERO instructions into 10 languages including Chinese, Japanese, and Arabic to build the LIBERO-Multilingual benchmark. OpenVLA-OFT achieves an average success rate of 97.1% in English, plummeting to 50.8%–65.3% in non-English languages; on Goal tasks, Arabic drops to just 6.4%, more than 91 percentage points below English. Representation bias and text-image gradient ratio analysis revealed that language influence is not uniformly distributed but concentrated at a few "critical nodes" — 53% of multilingual failures cluster in the "navigation" phase, which requires language to localize targets. The authors propose a step-wise inference-time intervention: offline identification of the gradient-sensitive top-50% of steps, followed by online alignment of representations toward the English reference direction at those steps. OpenVLA-OFT non-English average improves by 9.5pp; pi0.5 recovers by a substantial 24.2pp (Chinese: 56.7%→80.3%); applying a uniform mean shift or selecting steps randomly is nearly ineffective or even counterproductive.

GaussianDWM: Unifying Scene Understanding and Multimodal Generation for Autonomous Driving with 3D Gaussians · world-model

Most driving world models focus on "predicting/generating future frames" but cannot answer what objects are in the scene or where they are. This paper feeds the same 3D Gaussian representation into an LLM for understanding and uses it as a condition to drive generation, aiming to supply the explicit 3D structure that world models have been missing.

Tianchen Deng et al. (SJTU / Tsinghua / Megvii / Mach Drive) · CVPR 2026 · Code: github.com/dtc111111/GaussianDWM · Commentary: Jiqizhixin source

The method comprises three parts — World Tokenizer, scene understanding, and multimodal generation — all organized around the same set of 3D Gaussians: language features derived from CLIP and inheriting SAM's hierarchical semantics are layered onto the Gaussian primitives (a scene-level autoencoder compresses 512 dimensions to 3), then projected into the LLM embedding space via a Gaussian Projector with task-aware sampling (4,096 Gaussian tokens in the main experiment). On the generation side, a dual-conditioning design uses low-level RGB/depth to constrain texture and geometry while high-level world knowledge from the LLM supplies semantic spatial priors. On NuInteract, the average metric reaches 59.23 (vs. DriveMonkey's 52.12), with 2D/3D visual grounding mAP improving from 19.47/34.53 to 34.95/52.78.

SANA-WM: An Efficient World Model Deployable in Minutes on a Single GPU · world-model

Long-video world models typically require large models, large datasets, and multi-GPU inference — cost is the key obstacle to deploying them in embodied simulation. This work compresses "60-second 720p, camera-controllable" generation onto a single GPU, offering a cost-reduction path for scalable world model research.

Zhu Haoyi et al. (NVIDIA / University of Science and Technology of China) · arXiv 2605.15178 source · Code: NVlabs/Sana · Commentary: Lumina Embodied Intelligence source

The model generates 60-second, 720p, camera-motion-controllable video worlds from a first frame image, text, and a 6-DoF camera trajectory. The architecture uses a Hybrid Linear DiT combining Gated DeltaNet with softmax attention, maintaining long-context modeling while reducing compute and memory; a dual-branch camera controller with UCPE and Plücker ray conditioning improves trajectory-following accuracy. The team constructed a video dataset of approximately 213,000 clips with metric-scale camera pose annotations and applied a two-stage generation process with a long-video refiner to improve visual quality and temporal consistency.

LabVLA: Enabling Robots to "Run Experiments" in the Lab · vla

Scientific AI has long suffered a "brain-hand disconnect" — capable of reading literature and designing protocols, but unable to perform pipetting, centrifugation, and other physical tasks that consume 60% of a researcher's time. This is a full-stack solution combining a data engine, training recipe, and evaluation benchmark tailored to lab scenarios, not just a model architecture change.

Zhejiang University / Shanghai AI Lab et al. · Commentary: Roushen Algorithm source

Addressing the unique challenges of high-precision lab instruments, transparent liquids, and zero-tolerance protocol workflows, the team uses the simulation data engine RoboGenesis to overcome real-data scarcity: atomic skills are defined and composed into workflows, filtered through physics validation (checking for liquid spills and protocol compliance), then structured for export across multiple robot embodiments. The model undergoes two-stage training on Qwen3-VL-4B — first pretraining with FAST action tokens to familiarize the backbone with actions, then post-training with flow matching and a DiT action expert for continuous control output, with "knowledge isolation" freezing backbone weights to preserve existing visual-language reasoning. The LabUtopia benchmark for lab tasks is released alongside.

Other papers today: the asset-conversion pathway from "test data → world model training data" pioneered by Qingyan Precision and others has drawn wide discussion (W65); this week saw a concentrated output of 8 notable works in the humanoid Loco-Manip direction, covering whole-body control and mobile manipulation (W15).

Open-source · Tools · Benchmarks: HumanEgo open-sourced its code and accumulated 230+ stars within days (humanego-ai.github.io); SANA-WM released alongside NVlabs/Sana has gained 2.5k+ net new stars since launch; GaussianDWM, LabVLA's LabUtopia benchmark, and the RoboGenesis data engine were all released simultaneously; Zhiyuan opened the AGIBOT WORLD dataset and Genie Sim 3.0 simulation platform at the BAAI Conference.

II. Funding & Transactions

Jianzhou Robotics ｜ Multiple Consecutive Rounds ｜ Cumulative Hundreds of Millions RMB · adjacent

This round was co-led by Ant Group, DiDi, and Delian Capital, with returning investors Shunwei Capital, BV Baidu Ventures, and Jiushi Intelligence adding follow-on. It marks the first joint embodied-AI investment by Ant and DiDi and is the largest single funding to date in the "embodied data without robot hardware" segment. The company was founded in May 2025 by former Momenta senior algorithm director Chen Jianxing and senior intelligent driving product expert Zhu Yanming, who argue "data will achieve scale-up earlier than models." The company developed the Gen DAS passive wearable data-collection device in-house and launched Gen EgoData, a full-modality dataset for embodied world models encompassing vision, force, action trajectories, physical interaction outcomes, and chain-of-thought. With more than 30 AI companies as partners, and alongside JD's self-built collection centers and four robot makers co-investing in Zhiyu Cornerstone, the segment is shifting from "building bodies" to a data arms race for "nurturing brains." Source: Data Walker X source

Hongfire Intelligence (SoulX / MoYa) ｜ Angel Round ｜ Hillhouse Ventures Sole Investor · adjacent

This is the company's first external funding round, earmarked for R&D, mass-production delivery, and supply chain buildout of the MoYa soft family-care robot. MoYa enters through the sleep scenario: it resembles a plush toy designed to be hugged while sleeping, integrating soft structure, pneumatic actuation (air bladders inflating and deflating to simulate an embrace), breathing rhythms, and emotional companionship, deliberately scoped to "hugging, breathing, and gentle patting" with no vision module for now, to protect privacy. Founder Zheng Qian holds a robotics undergraduate degree from HIT and a PhD from Zhejiang University, with early-stage experience at an exoskeleton company. In contrast to the mainstream "general humanoid + industrial" narrative, MoYa pursues a differentiated niche-scenario approach and plans to launch in September this year. Source: Hillhouse Ventures source

Poke Robotics ｜ Angel Round ｜ Tens of Millions USD · embodied

Founded by Xu Huazhe after departing Xinghaitu — itself valued at over CNY 20 billion — who completed this round. Capital continues to favor the robotics segment; the question raised by Caixin is whether this reflects validated business models or companies stockpiling "ammunition" ahead of intensifying competition. Source: Caixin source

EngineAI ｜ Proposed HK IPO ｜ Valuation Exceeds CNY 10 Billion · humanoid ⚠️ Reported

The company reportedly filed confidentially with HKEX, working with CICC and CITIC Securities. Combined with Unitree's push toward the STAR Market, integrated humanoid manufacturers are collectively moving toward public markets. Source: Sohu source

Zhuganxian Technology ｜ Re-filed with HKEX · autonomy

Ranked fourth among China's commercial vehicle autonomous driving solution providers, focused on heavy-truck driverless operation in highway logistics and port scenarios. A prior filing did not proceed; this is the company's renewed attempt. Source: ifeng.com source

LAIFUAL Drive ｜ Passed HKEX Listing Hearing · hardware

Harmonic reducers are a core component of humanoid robot joints. The company posts annual revenue of CNY 260 million while still losing CNY 170 million, with Lenovo and China Development Bank Fund as shareholders. It is pushing into the capital markets on expectations of domestic substitution in the humanoid supply chain, though profitability remains a question mark. Source: Sohu source

ULTIROBOTICS ｜ RMB Tens of Millions · industrial

A warehouse embodied AI company, jointly invested by Changshu ETDZ Juyuan and Shenzhen Institute of Science and Technology Innovation, with Houlang Capital serving as strategic financial adviser. Its in-house Ulti-Brain model uses a hierarchical architecture integrating a world model, performing long-horizon continuous spatial perception from RGBD streams, with a focus on generalization that does not depend on customer-specific scene data. Source: Gaogong Humanoid Robots source

III. Commercialization & Deployment

Junpu Intelligent Rolls Off First G2 Robots, Putting Them to Work in 3C Factories · industrial

The G2 is equipped with the GO-1 embodied foundation model and completed 2,283 tasks over 8 consecutive hours in a 3C factory with zero errors; a thousand-unit order has been locked in. Four production lines at the Wuxi base roll off approximately one unit per hour, with a monthly capacity target of 300–400 units in August and plans to cut manufacturing costs by 20% within two years. Unlike the "general humanoid" approach, Junpu entered through deterministic scenarios — 3C electronics assembly, logistics sorting, and industrial handling — first. With the industry still largely at the demonstration stage, continuous zero-error data from a real production line is more compelling than lab footage. Source: Zhidian Chaijie source

Pony.ai Seventh-Generation Robotaxi Debuts at Chongqing Auto Show, Cost Falls Below CNY 230,000 · autonomy

The seventh-generation vehicle brings total vehicle cost below CNY 230,000, a critical step toward a viable unit economics model for scaled operations, as the company simultaneously signals accelerated overseas expansion. The cost curve is widely regarded as the decisive factor in the Robotaxi race, and this reduction aims to narrow the per-unit economic gap with the single-vehicle intelligence approach. Source: D1EV source

Waymo Acquires Apple's Former Arizona Autonomous Driving Test Site for $220 Million · autonomy

Waymo acquired Apple's former autonomous driving test facility in Arizona for approximately $220 million. Against the backdrop of its expansion into new cities such as Nashville and the development of its sixth-generation system, building proprietary testing and operational infrastructure has become a necessary companion to fleet scale-up. Source: MSN source

Domestic Robots Enter BMW's Shenyang Production Line for Validation · industrial ⚠️ Validation Stage

BMW's production line in the old industrial base of northeast China has introduced domestic humanoid robots for production floor trials. The reporting also candidly notes that repeated validation on a live line is still required — whether robots can reliably identify parts and pick up tools, and whether they can maintain safe distances in human-robot collaboration while sustaining prolonged high-intensity operation. This is an automotive plant pilot, not yet a mass-production milestone. Source: Sina Finance source

UBTECH U1 Ultra-Bionic Companion Robot Approaches 4,000 Pre-orders · humanoid

JD.com pre-orders over 10 days are approaching 4,000 units (up from a previously reported 2,700 in 6 days, continuing to rise), with a CNY 3,000 deposit targeting adult consumers. The robot features an "companion-raising" emotional AI model and is positioned as a family companion rather than a productivity tool. Note that this demand is primarily consumer novelty-seeking and is separate from industrial mass-production capability. Source: Pandaily source

Amazon Expands Warehouse Automation in India · industrial

Amazon announced further expansion of warehouse robotics and automation deployment in India, continuing the robotization of its global fulfillment network and signaling an acceleration of warehouse automation in emerging markets. Source: Tech in Asia source

IV. Industry Developments

XPENG: He Xiaopeng Takes Personal Command of Robotics, Announces Transformation into "Physical AI Company" · humanoid

The June 10 internal letter carries significant weight — "XPENG Robotics officially enters the eve of mass production and commercialization" and "this marks XPENG's transformation from an intelligent vehicle company to a Physical AI company." He Xiaopeng takes on the role of CEO of the robotics division, pulling group resources to fully replicate the automotive business's supply chain, manufacturing, and quality systems into robotics. The timeline calls for the IRON humanoid robot to enter mass production by end of 2026 and appear in XPENG showrooms as a sales guide in Q1 2027, with full-stack in-house R&D maintained from chips and OS to joints and dexterous hands — an analogy to BYD's vertical integration in battery manufacturing to compress costs. A risk factor: the core robotics team just went through departures in late May, making a 200-day sprint to mass production more challenging. This is another automaker — following BYD's move into humanoids — making a large-scale transfer of in-house vehicle capabilities to the embodied AI segment. Source: Zhidian Chaijie source

Zhiyuan Releases Embodied Foundation Model GO-2, Leading with "Action Chain-of-Thought" · world-model ⚠️ Vendor Claims

At this week's BAAI Conference, Zhiyuan unveiled GO-2, the next-generation embodied foundation model. Its core is ACoT-VLA (Action Chain-of-Thought): conventional VLAs must "observe scene → generate language description → map to action," with the intermediate language-translation step introducing information loss; GO-2 enables reasoning to occur directly in action space, outputting structured, kinematically feasible coarse-grained action intent sequences via two action reasoners (explicit and implicit), with asynchronous coarse-fine execution for real-time correction and a full-lifecycle closed loop for continuous self-optimization after deployment. Zhiyuan claims a 25% improvement in long-horizon task success rate over the baseline (vendor-stated figure, awaiting third-party replication). Also unveiled: the Elf G2 industrial robot, Genie Sim 3.0 simulation platform, AGIBOT WORLD open-source dataset, an ICRA million-dollar competition, and the "Yuansheng" ecosystem plan with a five-year CNY 2 billion commitment — presenting a combined foundation model + simulation + open-source data + developer ecosystem strategy. Source: Guijiyizhi source

Kunlun Tech Unveils World Model Matrix-Game 3.5 · world-model

At the BAAI Conference, Kunlun Tech's Tiangong team unveiled the latest progress on world model Matrix-Game 3.5. World models have become one of the most discussed topics at this year's conference, with researchers from embodied AI, robot control, game engines, and physical AI infrastructure each presenting their technical approaches; the debate over competing paradigms continues. Source: Kunlun Tech Group source

GM CEO: Prioritizing Consumer Vehicle Autonomy Now, Laying Groundwork for Ride-Hailing Later · autonomy ⚠️ Statement

Following the shutdown of Cruise, General Motors is shifting focus to embedding autonomous driving as a consumer vehicle feature, then using that as a foundation for future mobility services — effectively staying in the autonomous driving space via "vehicle-side autonomous capability" rather than an independent Robotaxi fleet. This contrasts with the approach of most Chinese players who are deploying Robotaxi fleets directly. Source: DoNews source

Humanoid Robots to Get "ID Cards": Full Lifecycle Management Standard Issued · humanoid

According to the Humanoid Robot and Embodied Intelligence Standardization Technical Committee under the Ministry of Industry and Information Technology, under the newly issued "Humanoid Robot Full Lifecycle Management Specification," every humanoid robot will be assigned an identity code (its "ID card"); approximately 28,000 units have already received codes. This is a foundational infrastructure step as the industry moves from "building them" to "managing them traceably." Source: Electronic and Electrical Metrology and Testing source

Hardware · Supply Chain: Linkerbot claims to be the world's only company achieving mass production of high-DoF dexterous hands at the ten-thousand-unit scale, holding approximately 80% of the global high-DoF dexterous hand market, with monthly capacity ramping from over 4,000 units toward the ten-thousand-unit level (W92/W95, ⚠️ vendor claims); Hamm Electronics' 8mm micro-motors are now shipping in volume with capacity scaling up, addressing incremental demand for dexterous hands as "consumables" in data collection and teleoperation (W93); Unitree is pursuing a QDD quasi-direct-drive approach (large motor + small gearbox) to replace harmonic reducers and cut costs, with joint actuators accounting for approximately 50%–70% of the robot BOM (W41); engineering plastics — which can reduce density by 50%–70% compared to metal — are accelerating adoption in humanoid structural components (W42).