RSS Fees
AI Feed for everyone
Evaluating speech synthesis in many languages with SQuId.
This post will talk about how helpful is SQuID in many lanuages, why it's become a woderful tool that some reseachers belive in.
SQuID (Speech Quality Integeration Dataset) which allow us to generate integerations data used to train TTS like Tacotron.
Evaluating speech synthesis in many languages with SQuId Wednesday, June 07, 2023 Posted by Thibault Sellam, Research Scientist, Google Previously, we presented the 1,000 languages initiative and the Universal Speech Model with the goal of making speech and language technologies available to billions of users around the world. Part of this commitment invo...
Visual captions: Using large language models to augment video conferences with dynamic visuals
Video conferencing platforms are increasingly replacing in-person interactions, particularly in the time of the COVID-19 pandemic.
To maximize the effectiveness of communication in such environments, we augment video conferences with dynamic visuals that help video conference participants comprehend the original speaker’s ideas.
By training a large language model on a huge number of captioned videos, we design a system that automatically generates images or a video clip that corresponds to the speaker’s utterances.
The images or videos act as additional visual aids shown to participants and assist those who are having difficulty understanding what the speaker is saying.
To validate our system’s effectiveness, we conducted a user study in which study participants interacted with the system via their video conferences and reported the system’s usefulness.
Our results show that the system improves the participants' understanding of the conference and their satisfaction of the conference.
Visual captions: Using large language models to augment video conferences with dynamic visuals Tuesday, June 06, 2023 Posted by Ruofei Du, Research Scientist, and Alex Olwal, Senior Staff Research Scientist, Google Augmented Reality Recent advances in video conferencing have significantly improved remote video communication through features like live captioning and noise cancellation. However...
Posted by Arsha Nagrani and Paul Hongsuck Seo, Research Scientists, Google Research[https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgEpnVot9nisUn-B5GierK8WfqI_mhVwKmJ1mjj_FOdOr2_74gIwvwSJgobi8K_a1uq9n5B3pO4Zi5-mXoE0mzOw32WW-ny6kBThTgHIv3wtQVrpjdMdhMYCSqustzlN7ArP3y2jSqEy_2rlz_zXfsfp05Z3svQ7bu8rYJAN6V1ellHjbL-SJOHraUHEQ/s1150/AVFormer.png]Automatic speech recognition [https://en.wikipedia.org/wiki/Speech_recognition] (ASR) is a well-established technology that is widely adopted for various applications such as conference calls, streamed video transcription and voice commands. While the challenges for this technology are centered around noisy audio inputs, the visual stream in multimodal videos (e.g., TV, online edited videos) can provide strong cues for improving the robustness of ASR systems — this is called audiovisual ASR (AV-ASR).
Although lip motion can provide strong signals for speech recognition and is the most common area of focus for AV-ASR, the mouth is often not directly visible in videos in the wild (e.g., due to egocentric viewpoints [https://en.wikipedia.org/wiki/Egocentric_vision], face coverings, and low resolution) and therefore, a new emerging area of research is unconstrained AV-ASR (e.g., AVATAR [https://arxiv.org/abs/2206.07684]), which investigates the contribution of entire visual frames, and not just the mouth region.
Building audiovisual datasets for training AV-ASR models, however, is challenging. Datasets such as How2 [https://srvk.github.io/how2-dataset/] and VisSpeech [https://gabeur.github.io/avatar-visspeech] have been created from instructional videos online, but they are small in size. In contrast, the models themselves are typically large and consist of both visual and audio encoders, and so they tend to overfit on these small datasets. Nonetheless, there have been a number of recently released large-scale audio-only models that are heavily optimized via large-scale training on massive audio-only data obtained from audio books, such as LibriLight [https://github.com/facebookresearch/libri-light] and LibriSpeech [https://www.openslr.org/12]. These models contain billions of parameters, are readily available, and show strong generalization across domains.
With the above challenges in mind, in “AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR [https://arxiv.org/pdf/2303.16501.pdf]”, we present a simple method for augmenting existing large-scale audio-only models with visual information, at the same time performing lightweight domain adaptation. AVFormer injects visual embeddings into a frozen ASR model (similar to how Flamingo injects visual information [https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model] into large language models for vision-text tasks) using lightweight trainable adaptors that can be trained on a small amount of weakly labeled video data with minimum additional training time and parameters. We also introduce a simple curriculum scheme during training, which we show is crucial to enable the model to jointly process audio and visual information effectively. The resulting AVFormer model achieves state-of-the-art zero-shot performance on three different AV-ASR benchmarks (How2, VisSpeech and Ego4D [https://ego4d-data.org/]), while also crucially preserving decent performance on traditional audio-only speech recognition benchmarks (i.e., LibriSpeech [https://www.openslr.org/12]).
[https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxMFN8ianu3Y1iJ4tB5Bt7xSnYsHXt3-oYl3LnZjpHt9bPrUHFgQ14msppOwZ7TmOFMUSXc86l5tcuB1W6SroOQiiT__IB7ZyeiHRxTf51xY-f_i3iWCRzIDmB2ciAlvjsNATTDVBt0nvuEnWBQSoKou0PNBF7UYfkI2xM0MxPZa1JlxtOSHePEgxiVw/s16000/image2.png] [https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxMFN8ianu3Y1iJ4tB5Bt7xSnYsHXt3-oYl3LnZjpHt9bPrUHFgQ14msppOwZ7TmOFMUSXc86l5tcuB1W6SroOQiiT__IB7ZyeiHRxTf51xY-f_i3iWCRzIDmB2ciAlvjsNATTDVBt0nvuEnWBQSoKou0PNBF7UYfkI2xM0MxPZa1JlxtOSHePEgxiVw/s958/image2.png]Unconstrained audiovisual speech recognition. We inject vision into a frozen speech model (BEST-RQ [https://arxiv.org/pdf/2202.01855.pdf], in grey) for zero-shot audiovisual ASR via lightweight modules to create a parameter- and data-efficient model called AVFormer (blue). The visual context can provide helpful clues for robust speech recognition especially when the audio signal is noisy (the visual loaf of bread helps correct the audio-only mistake “clove” to “loaf” in the generated transcript).
INJECTING VISION USING LIGHTWEIGHT MODULES
Our goal is to add visual understanding capabilities to an existing audio-only ASR model while maintaining its generalization performance to various domains (both AV and audio-only domains).
To achieve this, we augment an existing state-of-the-art ASR model (Best-RQ [https://arxiv.org/pdf/2202.01855.pdf]) with the following two components: (i) linear visual projector and (ii) lightweight adapters. The former projects visual features in the audio token embedding space. This process allows the model to properly connect separately pre-trained visual feature and audio input token representations. The latter then minimally modifies the model to add understanding of multimodal inputs from videos. We then train these additional modules on unlabeled web videos from the HowTo100M dataset [https://www.di.ens.fr/willow/research/howto100m/], along with the outputs of an ASR model as pseudo ground truth, while keeping the rest of the Best-RQ model frozen. Such lightweight modules enable data-efficiency and strong generalization of performance.
We evaluated our extended model on AV-ASR benchmarks in a zero-shot setting, where the model is never trained on a manually annotated AV-ASR dataset.
CURRICULUM LEARNING FOR VISION INJECTION
After the initial evaluation, we discovered empirically that with a naïve single round of joint training, the model struggles to learn both the adapters and the visual projectors in one go. To mitigate this issue, we introduced a two-phase curriculum learning strategy [https://pubmed.ncbi.nlm.nih.gov/8403835/] that decouples these two factors — domain adaptation and visual feature integration — and trains the network in a sequential manner. In the first phase, the adapter parameters are optimized without feeding visual tokens at all. Once the adapters are trained, we add the visual tokens and train the visual projection layers alone in the second phase while the trained adapters are kept frozen.
The first stage focuses on audio domain adaptation. By the second phase, the adapters are completely frozen and the visual projector must simply learn to generate visual prompts that project the visual tokens into the audio space. In this way, our curriculum learning strategy allows the model to incorporate visual inputs as well as adapt to new audio domains in AV-ASR benchmarks. We apply each phase just once, as an iterative application of alternating phases leads to performance degradation.
[https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_hr_fd1iI5JofcM_c_4rRIayzvF5dlQ7T-keHxvHuLRlqmlKhs_Yy-IOuiYrj8GrnscXNB_vN1oXqlTnxvzL0dyl9PraXfuIQ91lQttaY4-e019DmRCxRllEIJj6T3RGn5J-RyTpALh1xgpVugcYQBy20rQMbjLg2CKLTazNJLM37lAWL3kiHH8pmGA/s16000/image1.png] [https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_hr_fd1iI5JofcM_c_4rRIayzvF5dlQ7T-keHxvHuLRlqmlKhs_Yy-IOuiYrj8GrnscXNB_vN1oXqlTnxvzL0dyl9PraXfuIQ91lQttaY4-e019DmRCxRllEIJj6T3RGn5J-RyTpALh1xgpVugcYQBy20rQMbjLg2CKLTazNJLM37lAWL3kiHH8pmGA/s1318/image1.png]Overall architecture and training procedure for AVFormer. The architecture consists of a frozen Conformer [https://arxiv.org/pdf/2005.08100.pdf] encoder-decoder model, and a frozen CLIP [https://openai.com/research/clip] encoder (frozen layers shown in gray with a lock symbol), in conjunction with two lightweight trainable modules - (i) visual projection layer (orange) and bottleneck adapters (blue) to enable multimodal domain adaptation. We propose a two-phase curriculum learning strategy: the adapters (blue) are first trained without any visual tokens, after which the visual projection layer (orange) is tuned while all the other parts are kept frozen.The plots below show that without curriculum learning, our AV-ASR model is worse than the audio-only baseline across all datasets, with the gap increasing as more visual tokens are added. In contrast, when the proposed two-phase curriculum is applied, our AV-ASR model performs significantly better than the baseline audio-only model.
[https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxlm_YiD6Bdma_rFDzaZL4lnvmSamUWD5l6jLB8F4411UV1UL2Pe3Ot3iBdNhJpXhYMsa_2m_3VZ6VnZCWmK4K50h2LfGdMBP-_TqTWBAFKX-Aqq9rdpo9P1n48TV3zPTIBBwfyMEbzHMdTvU5rq1OZVkao5K-Gjl3kk7O9zsAFGk56aaMS2xrOr8T8w/s16000/image3.png] [https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxlm_YiD6Bdma_rFDzaZL4lnvmSamUWD5l6jLB8F4411UV1UL2Pe3Ot3iBdNhJpXhYMsa_2m_3VZ6VnZCWmK4K50h2LfGdMBP-_TqTWBAFKX-Aqq9rdpo9P1n48TV3zPTIBBwfyMEbzHMdTvU5rq1OZVkao5K-Gjl3kk7O9zsAFGk56aaMS2xrOr8T8w/s1374/image3.png]Effects of curriculum learning. Red and blue lines are for audiovisual models and are shown on 3 datasets in the zero-shot setting (lower WER [https://en.wikipedia.org/wiki/Word_error_rate] % is better). Using the curriculum helps on all 3 datasets (for How2 (a) and Ego4D (c) it is crucial for outperforming audio-only performance). Performance improves up until 4 visual tokens, at which point it saturates.
RESULTS IN ZERO-SHOT AV-ASR
We compare AVFormer to BEST-RQ, the audio version of our model, and AVATAR, the state of the art in AV-ASR, for zero-shot performance on the three AV-ASR benchmarks: How2, VisSpeech and Ego4D. AVFormer outperforms AVATAR and BEST-RQ on all, even outperforming both AVATAR and BEST-RQ when they are trained on LibriSpeech and the full set of HowTo100M. This is notable because for BEST-RQ, this involves training 600M parameters, while AVFormer only trains 4M parameters and therefore requires only a small fraction of the training dataset (5% of HowTo100M). Moreover, we also evaluate performance on LibriSpeech, which is audio-only, and AVFormer outperforms both baselines.[https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKNT2kWtP5VgTsln7LWDMpxkLacsXHRAaM8rHxnUkL3o3mh5UAt3CFZ02wX-AzllqEw7V5fDYFC5yDe-QVO9oMjFPx6b2Lw9qyCsgXUVQm4JQPqbN52V4D9u9SwaR79aF7vXsGHMzxN4StK0YZe059gxa_pUXRwea44zz8wyIs6drTrxyk8Qa48CGr1g/s16000/AVFormer_results%20(1).png] [https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKNT2kWtP5VgTsln7LWDMpxkLacsXHRAaM8rHxnUkL3o3mh5UAt3CFZ02wX-AzllqEw7V5fDYFC5yDe-QVO9oMjFPx6b2Lw9qyCsgXUVQm4JQPqbN52V4D9u9SwaR79aF7vXsGHMzxN4StK0YZe059gxa_pUXRwea44zz8wyIs6drTrxyk8Qa48CGr1g/s3475/AVFormer_results%20(1).png]Comparison to state-of-the-art methods for zero-shot performance across different AV-ASR datasets. We also show performances on LibriSpeech which is audio-only. Results are reported as WER % (lower is better). AVATAR and BEST-RQ are finetuned end-to-end (all parameters) on HowTo100M whereas AVFormer works effectively even with 5% of the dataset thanks to the small set of finetuned parameters.
CONCLUSION
We introduce AVFormer, a lightweight method for adapting existing, frozen state-of-the-art ASR models for AV-ASR. Our approach is practical and efficient, and achieves impressive zero-shot performance. As ASR models get larger and larger, tuning the entire parameter set of pre-trained models becomes impractical (even more so for different domains). Our method seamlessly allows both domain transfer and visual input mixing in the same, parameter efficient model.
ACKNOWLEDGEMENTS
This research was conducted by Paul Hongsuck Seo, Arsha Nagrani and Cordelia Schmid.
AVFormer: Injecting vision into frozen speech models for zero-shot AV-ASR Friday, June 02, 2023 Posted by Arsha Nagrani and Paul Hongsuck Seo, Research Scientists, Google Research Automatic speech recognition (ASR) is a well-established technology that is widely adopted for various applications such as conference calls, streamed video transcription and voice commands. Whi...
Posted by Ken Caluwaerts and Atil Iscen, Research Scientists, Google[https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgrBp7HZPrMFbbwb27gWZsRozP9LIfcn-Jc-MRqipA7Wi93OOwcpuYFDA_wJVepiwlAHT8cbtKn_IatNGhbX0qBHodjCWrluPI9_56aGOsScLWhGkvQ9jNaQUwU1uZNgg0U4-dWFwVyK3aCZPl5Z6frQimBaIZuyZHnQBY-KOUHmdOJNpkDCW1I8Fl6Bw/s320/barkour%20hero.gif]Creating robots that exhibit robust and dynamic locomotion capabilities, similar to animals or humans, has been a long-standing goal in the robotics community. In addition to completing tasks quickly and efficiently, agility allows legged robots to move through complex environments [https://ai.googleblog.com/2023/05/indoorsim-to-outdoorreal-learning-to.html] that are otherwise difficult to traverse. Researchers at Google have been pursuing agility for multiple years [https://arxiv.org/abs/1804.10332] and across various form factors [https://ai.googleblog.com/2022/10/table-tennis-research-platform-for.html]. Yet, while researchers have enabled robots to hike [https://www.science.org/doi/10.1126/scirobotics.abc5986] or jump over some obstacles [https://www.roboticsproceedings.org/rss11/p47.pdf], there is still no generally accepted benchmark that comprehensively measures robot agility or mobility. In contrast, benchmarks are driving forces behind the development of machine learning, such as ImageNet [https://arxiv.org/abs/1409.0575] for computer vision, and OpenAI Gym [https://github.com/openai/gym] for reinforcement learning (RL).
In “Barkour: Benchmarking Animal-level Agility with Quadruped Robots [https://arxiv.org/abs/2305.14654]”, we introduce the Barkour agility benchmark for quadruped robots, along with a Transformer [https://arxiv.org/abs/1706.03762]-based generalist locomotion policy. Inspired by dog agility competitions, a legged robot must sequentially display a variety of skills, including moving in different directions, traversing uneven terrains, and jumping over obstacles within a limited timeframe to successfully complete the benchmark. By providing a diverse and challenging obstacle course, the Barkour benchmark encourages researchers to develop locomotion controllers that move fast in a controllable and versatile way. Furthermore, by tying the performance metric to real dog performance, we provide an intuitive metric to understand the robot performance with respect to their animal counterparts.
We invited a handful of dooglers [https://blog.google/inside-google/life-at-google/working-home-ruff-dooglers-make-it-little-better/] to try the obstacle course to ensure that our agility objectives were realistic and challenging. Small dogs complete the obstacle course in approximately 10s, whereas our robot’s typical performance hovers around 20s.
BARKOUR BENCHMARK
The Barkour scoring system uses a per obstacle and an overall course target time based on the target speed of small dogs in the novice agility competitions [https://images.akc.org/pdf/rulebooks/REAGIL.pdf] (about 1.7m/s). Barkour scores range from 0 to 1, with 1 corresponding to the robot successfully traversing all the obstacles along the course within the allotted time of approximately 10 seconds, the average time needed for a similar-sized dog to traverse the course. The robot receives penalties for skipping, failing obstacles, or moving too slowly.
Our standard course consists of four unique obstacles in a 5m x 5m area. This is a denser and smaller setup than a typical dog competition to allow for easy deployment in a robotics lab. Beginning at the start table, the robot needs to weave through a set of poles, climb an A-frame, clear a 0.5m broad jump and then step onto the end table. We chose this subset of obstacles because they test a diverse set of skills while keeping the setup within a small footprint. As is the case for real dog agility competitions, the Barkour benchmark can be easily adapted to a larger course area and may incorporate a variable number of obstacles and course configurations.
[https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1ExmuR_LT8UzVANv1nLsfQGLEA04KYuwTtxCeP0-UnCmIT1ID0tCrxcNqhR8qmCKbQIdCmaGlgtsmIgim1R5B1kBNkfBVCgUmaxa96F-2WyRk-hMnHlKPYBlRnZ8aT02xIGRweZGaRLjHMzQjS5QjX9erHh_h2IHtyVGywOfRw9J0UhHu6-1oWjgaAg/s16000/image7.png] [https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1ExmuR_LT8UzVANv1nLsfQGLEA04KYuwTtxCeP0-UnCmIT1ID0tCrxcNqhR8qmCKbQIdCmaGlgtsmIgim1R5B1kBNkfBVCgUmaxa96F-2WyRk-hMnHlKPYBlRnZ8aT02xIGRweZGaRLjHMzQjS5QjX9erHh_h2IHtyVGywOfRw9J0UhHu6-1oWjgaAg/s1193/image7.png]Overview of the Barkour benchmark’s obstacle course setup, which consists of weave poles, an A-frame, a broad jump, and pause tables. The intuitive scoring mechanism, inspired by dog agility competitions, balances speed, agility and performance and can be easily modified to incorporate other types of obstacles or course configurations.
LEARNING AGILE LOCOMOTION SKILLS
The Barkour benchmark features a diverse set of obstacles and a delayed reward system, which pose a significant challenge when training a single policy that can complete the entire obstacle course. So in order to set a strong performance baseline and demonstrate the effectiveness of the benchmark for robotic agility research, we adopt a student-teacher framework combined with a zero-shot sim-to-real approach. First, we train individual specialist locomotion skills (teacher) for different obstacles using on-policy RL methods. In particular, we leverage recent advances [https://github.com/google/brax] in large-scale parallel simulation [https://arxiv.org/pdf/2108.10470.pdf] to equip the robot with individual skills, including walking, slope climbing, and jumping policies.
Next, we train a single policy (student) that performs all the skills and transitions in between by using a student-teacher framework, based on the specialist skills we previously trained. We use simulation rollouts to create datasets of state-action pairs for each one of the specialist skills. This dataset is then distilled into a single Transformer-based generalist locomotion policy, which can handle various terrains and adjust the robot's gait based on the perceived environment and the robot’s state.
[https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEifG3krWfLwod1U-km9joIoiyxc_nUZo6Us0uv1dBTzvcb6Q_4Bz2fO2tYUz1v_CHmcda8wnnOHbpa3ZSkY33Lwtr0fJXe6tNskpT-uSgG8_vZSu-cxtUlLNd4M8QtIkSzhkFjaCkXLWjDJ_aWN67xIUJCBiQoMsV2-NvRIHtV7eeZWY19ppQ88qcp7Vg/s16000/Barkour%20training.gif] [https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEifG3krWfLwod1U-km9joIoiyxc_nUZo6Us0uv1dBTzvcb6Q_4Bz2fO2tYUz1v_CHmcda8wnnOHbpa3ZSkY33Lwtr0fJXe6tNskpT-uSgG8_vZSu-cxtUlLNd4M8QtIkSzhkFjaCkXLWjDJ_aWN67xIUJCBiQoMsV2-NvRIHtV7eeZWY19ppQ88qcp7Vg/s1547/Barkour%20training.gif]During deployment, we pair the locomotion transformer policy that is capable of performing multiple skills with a navigation controller that provides velocity commands based on the robot's position. Our trained policy controls the robot based on the robot's surroundings represented as an elevation map, velocity commands, and on-board sensory information provided by the robot.
Deployment pipeline for the locomotion transformer architecture. At deployment time, a high-level navigation controller guides the real robot through the obstacle course by sending commands to the locomotion transformer policy.
Robustness and repeatability are difficult to achieve when we aim for peak performance and maximum speed. Sometimes, the robot might fail when overcoming an obstacle in an agile way. To handle failures we train a recovery policy [https://arxiv.org/pdf/2110.05457.pdf] that quickly gets the robot back on its feet, allowing it to continue the episode.
EVALUATION
We evaluate the Transformer-based generalist locomotion policy using custom-built quadruped robots and show that by optimizing for the proposed benchmark, we obtain agile, robust, and versatile skills for our robot in the real world. We further provide analysis for various design choices in our system and their impact on the system performance.
[https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgoNo6kxG9PGgP4Rl8t0GWS1O6bgWtJuA2wMtY95eZQQinIJ-54mJhaWxC_pWLFGNS7wI9ZVUhv24Eoix1o7T-KKHb1aNces1vp9I_3otqttMh0Y8Y0j_O4qe2kX5oNRtfX5pQWh-0sFBG8zn6qfbBx3mOd6HPcF3nYembjm1i1Vun1YPBfjvpJhvkxIQ/w400-h269/image4.jpg] [https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgoNo6kxG9PGgP4Rl8t0GWS1O6bgWtJuA2wMtY95eZQQinIJ-54mJhaWxC_pWLFGNS7wI9ZVUhv24Eoix1o7T-KKHb1aNces1vp9I_3otqttMh0Y8Y0j_O4qe2kX5oNRtfX5pQWh-0sFBG8zn6qfbBx3mOd6HPcF3nYembjm1i1Vun1YPBfjvpJhvkxIQ/s1895/image4.jpg]Model of the custom-built robots used for evaluation.
We deploy both the specialist and generalist policies to hardware (zero-shot sim-to-real). The robot’s target trajectory is provided by a set of waypoints along the various obstacles. In the case of the specialist policies, we switch between specialist policies by using a hand-tuned policy switching mechanism that selects the most suitable policy given the robot’s position.
Typical performance of our agile locomotion policies on the Barkour benchmark. Our custom-built quadruped robot robustly navigates the terrain’s obstacles by leveraging various skills learned using RL in simulation.
We find that very often our policies can handle unexpected events or even hardware degradation resulting in good average performance, but failures are still possible. As illustrated in the image below, in case of failures, our recovery policy quickly gets the robot back on its feet, allowing it to continue the episode. By combining the recovery policy with a simple walk-back-to-start policy, we are able to run repeated experiments with minimal human intervention to measure the robustness.
Qualitative example of robustness and recovery behaviors. The robot trips and rolls over after heading down the A-frame. This triggers the recovery policy, which enables the robot to get back up and continue the course.
We find that across a large number of evaluations, the single generalist locomotion transformer policy and the specialist policies with the policy switching mechanism achieve similar performance. The locomotion transformer policy has a slightly lower average Barkour score, but exhibits smoother transitions between behaviors and gaits.
Measuring robustness of the different policies across a large number of runs on the Barkour benchmark.
[https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgrYpGuGZTId6CNusbroNBZkgGFi9-8V9sAPoGmjIWWyHqppAqkDnD3FyWEDKWlbIDqMAjtmSo5Xwq7_Prwr-ajuGNeejKn8eIrcFmBIm560mlt5ogQh2hxCGTKWzHx48oPdgPN54dgY7q03SHaCXARYcRkTKVrZbbStg9WjLGP3TGVu-8PLA0jRBjxdw/w640-h300/image3.png] [https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgrYpGuGZTId6CNusbroNBZkgGFi9-8V9sAPoGmjIWWyHqppAqkDnD3FyWEDKWlbIDqMAjtmSo5Xwq7_Prwr-ajuGNeejKn8eIrcFmBIm560mlt5ogQh2hxCGTKWzHx48oPdgPN54dgY7q03SHaCXARYcRkTKVrZbbStg9WjLGP3TGVu-8PLA0jRBjxdw/s1313/image3.png]Histogram of the agility scores for the locomotion transformer policy. The highest scores shown in blue (0.75 - 0.9) represent the runs where the robot successfully completes all obstacles.
CONCLUSION
We believe that developing a benchmark for legged robotics is an important first step in quantifying progress toward animal-level agility. To establish a strong baseline, we investigated a zero-shot sim-to-real approach, taking advantage of large-scale parallel simulation and recent advancements in training Transformer-based architectures. Our findings demonstrate that Barkour is a challenging benchmark that can be easily customized, and that our learning-based method for solving the benchmark provides a quadruped robot with a single low-level policy that can perform a variety of agile low-level skills.
ACKNOWLEDGMENTS
The authors of this post are now part of Google DeepMind. We would like to thank our co-authors at Google DeepMind and our collaborators at Google Research: Wenhao Yu, J. Chase Kew, Tingnan Zhang, Daniel Freeman, Kuang-Hei Lee, Lisa Lee, Stefano Saliceti, Vincent Zhuang, Nathan Batchelor, Steven Bohez, Federico Casarini, Jose Enrique Chen, Omar Cortes, Erwin Coumans, Adil Dostmohamed, Gabriel Dulac-Arnold, Alejandro Escontrela, Erik Frey, Roland Hafner, Deepali Jain, Yuheng Kuang, Edward Lee, Linda Luu, Ofir Nachum, Ken Oslund, Jason Powell, Diego Reyes, Francesco Romano, Feresteh Sadeghi, Ron Sloat, Baruch Tabanpour, Daniel Zheng, Michael Neunert, Raia Hadsell, Nicolas Heess, Francesco Nori, Jeff Seto, Carolina Parada, Vikas Sindhwani, Vincent Vanhoucke, and Jie Tan. We would also like to thank Marissa Giustina, Ben Jyenis, Gus Kouretas, Nubby Lee, James Lubin, Sherry Moore, Thinh Nguyen, Krista Reymann, Satoshi Kataoka, Trish Blazina, and the members of the robotics team at Google DeepMind for their contributions to the project.Thanks to John Guilyard for creating the animations in this post.
Barkour: Benchmarking animal-level agility with quadruped robots Friday, May 26, 2023 Posted by Ken Caluwaerts and Atil Iscen, Research Scientists, Google Creating robots that exhibit robust and dynamic locomotion capabilities, similar to animals or humans, has been a long-standing goal in the robotics community. In addition to completing tasks quickly and effici...
Posted by Vincent Cohen-Addad and Alessandro Epasto, Research Scientists, Google Research, Graph Mining team[https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjG6jKdvoUMqPI-WRS9KnK6Clj8W-uXfR_Pd3BfLIeSQGMb7sJtp1dTNlITKnF5CTtw4c-JtRIi9r_ySKXmKLeIGBTeLozFnQWQhp6aiXMHVctZjqQfcl2LGDywb3SktCtPwQV9OgAJZ9PyMsAOxeUxxjdRzTf3CIApROfx6hSFBv7NItHKjB8LbFgiTQ/s900/COVID.jpg]Clustering [https://en.wikipedia.org/wiki/Cluster_analysis] is a central problem in unsupervised [https://en.wikipedia.org/wiki/Unsupervised_learning] machine learning (ML) with many applications across domains in both industry and academic research more broadly. At its core, clustering consists of the following problem: given a set of data elements, the goal is to partition the data elements into groups such that similar objects are in the same group, while dissimilar objects are in different groups. This problem has been studied in math, computer science, operations research and statistics for more than 60 years in its myriad variants. Two common forms of clustering [https://en.wikipedia.org/wiki/Cluster_analysis] are metric clustering, in which the elements are points in a metric space [https://en.wikipedia.org/wiki/Metric_space], like in the k-means [https://ieeexplore.ieee.org/document/1056489] problem, and graph clustering, where the elements are nodes of a graph [https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)] whose edges represent similarity among them.
[https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhaLHvb0dg44mUKzExJZDrM_OgIHx80OnDaguHi6ZyysbfP-iwBEFndK6UfO_y3hvbZKWwJREJTltz6qDr0k5PNqsy-k0WSu4f863w2aKqencPuE7ZKNdqOgckRITkuDDwwEEcvL588GQKjS8Td8Bgz4lQYYgLw7VVMd46DdduACj6iJzbhwmVG4CDfrQ/s320/image1.png] [https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhaLHvb0dg44mUKzExJZDrM_OgIHx80OnDaguHi6ZyysbfP-iwBEFndK6UfO_y3hvbZKWwJREJTltz6qDr0k5PNqsy-k0WSu4f863w2aKqencPuE7ZKNdqOgckRITkuDDwwEEcvL588GQKjS8Td8Bgz4lQYYgLw7VVMd46DdduACj6iJzbhwmVG4CDfrQ/s596/image1.png]In the k-means [https://ieeexplore.ieee.org/document/1056489] clustering problem, we are given a set of points in a metric space with the objective to identify k representative points, called centers (here depicted as triangles), so as to minimize the sum of the squared distances from each point to its closest center. Source [https://commons.wikimedia.org/wiki/File:Kmeans_toomany.PNG], rights: CC-BY-SA-4.0Despite the extensive literature on algorithm design for clustering, few practical works have focused on rigorously protecting the user's privacy during clustering. When clustering is applied to personal data (e.g., the queries a user has made), it is necessary to consider the privacy implications of using a clustering solution in a real system and how much information the output solution reveals about the input data.
To ensure privacy in a rigorous sense, one solution is to develop differentially private [https://en.wikipedia.org/wiki/Differential_privacy] (DP) clustering algorithms. These algorithms ensure that the output of the clustering does not reveal private information about a specific data element (e.g., whether a user has made a given query) or sensitive data about the input graph (e.g., a relationship in a social network). Given the importance of privacy protections in unsupervised machine learning, in recent years Google has invested in research on theory [https://arxiv.org/abs/2008.08007] and practice [https://ai.googleblog.com/2021/10/practical-differentially-private.html?hl=el&m=1] of differentially private metric [https://proceedings.neurips.cc/paper_files/paper/2022/hash/43f55776896a2e33239c2954519f605e-Abstract-Conference.html] or graph [https://proceedings.neurips.cc/paper_files/paper/2022/hash/da645920dcd3bd35b0dae329894bad80-Abstract-Conference.html] clustering, and differential privacy in a variety of contexts, e.g.,heatmaps [https://ai.googleblog.com/2023/04/differentially-private-heatmaps.html]or tools [https://ai.googleblog.com/2022/12/differential-privacy-accounting-by.html] to design DP algorithms.
Today we are excited to announce two important updates: 1) a new differentially-private algorithm [https://arxiv.org/abs/2302.00037] for hierarchical graph clustering, which we’ll be presenting at ICML 2023 [https://icml.cc/Conferences/2023], and 2) the open-source release [https://github.com/google-research/google-research/tree/master/hst_clustering] of the code of a scalable differentially-private k-means algorithm. This code brings differentially private k-means clustering to large scale datasets using distributed computing. Here, we will also discuss our work on clustering technology for a recent launch in the health domain for informing public health authorities.
DIFFERENTIALLY PRIVATE HIERARCHICAL CLUSTERING
Hierarchical clustering is a popular clustering approach that consists of recursively partitioning a dataset into clusters at an increasingly finer granularity. A well known example of hierarchical clustering is the phylogenetic tree [https://en.wikipedia.org/wiki/Phylogenetic_tree] in biology in which all life on Earth is partitioned into finer and finer groups (e.g., kingdom, phylum, class, order, etc.). A hierarchical clustering algorithm receives as input a graph representing the similarity of entities and learns such recursive partitions in an unsupervised way. Yet at the time of our research no algorithm was known to compute hierarchical clustering of a graph with edge privacy, i.e., preserving the privacy of the vertex interactions.
In “Differentially-Private Hierarchical Clustering with Provable Approximation Guarantees [https://doi.org/10.48550/arXiv.2302.00037]”, we consider how well the problem can be approximated in a DP context and establish firm upper and lower bounds on the privacy guarantee. We design an approximation algorithm (the first of its kind) with a polynomial running time that achieves both an additive error that scales with the number of nodes n (of order n2.5) and a multiplicative approximation of O(log½ n), with the multiplicative error identical to the non-private setting. We further provide a new lower bound on the additive error (of order n2) for any private algorithm (irrespective of its running time) and provide an exponential-time algorithm that matches this lower bound. Moreover, our paper includes a beyond-worst-case analysis focusing on the hierarchical stochastic block model [https://proceedings.neurips.cc/paper/2017/hash/e8bf0f27d70d480d3ab793bb7619aaa5-Abstract.html], a standard random graph model that exhibits a natural hierarchical clustering structure, and introduces a private algorithm that returns a solution with an additive cost over the optimum that is negligible for larger and larger graphs, again matching the non-private state-of-the-art approaches. We believe this work expands the understanding of privacy preserving algorithms on graph data and will enable new applications in such settings.
LARGE-SCALE DIFFERENTIALLY PRIVATE CLUSTERING
We now switch gears and discuss our work for metric space clustering. Most prior work in DP metric clustering has focused on improving the approximation guarantees of the algorithms on the k-means objective, leaving scalability questions out of the picture. Indeed, it is not clear how efficient non-private algorithms such as k-means++ [https://en.wikipedia.org/wiki/K-means%2B%2B] or k-means// [https://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf] can be made differentially private without sacrificing drastically either on the approximation guarantees or the scalability. On the other hand, both scalability and privacy are of primary importance at Google. For this reason, we recently published multiple [https://dl.acm.org/doi/10.1145/3534678.3539409] papers [https://proceedings.neurips.cc/paper_files/paper/2022/hash/43f55776896a2e33239c2954519f605e-Abstract-Conference.html] that address the problem of designing efficient differentially private algorithms for clustering that can scale to massive datasets. Our goal is, moreover, to offer scalability to large scale input datasets, even when the target number of centers, k, is large.
We work in the massively parallel computation [https://www.cs.umd.edu/~gasarch/MPC/mpc.pdf] (MPC) model, which is a computation model representative of modern distributed computation architectures. The model consists of several machines, each holding only part of the input data, that work together with the goal of solving a global problem while minimizing the amount of communication between machines. We present a differentially private constant factor approximation algorithm [https://proceedings.neurips.cc/paper_files/paper/2022/hash/43f55776896a2e33239c2954519f605e-Abstract-Conference.html] for k-means that only requires a constant number of rounds of synchronization. Our algorithm builds upon our previous work [https://dl.acm.org/doi/10.1145/3534678.3539409] on the problem (with code available here [https://github.com/google-research/google-research/tree/master/hst_clustering]), which was the first differentially-private clustering algorithm with provable approximation guarantees that can work in the MPC model.
The DP constant factor approximation algorithm drastically improves on the previous work using a two phase approach. In an initial phase it computes a crude approximation to “seed” the second phase, which consists of a more sophisticated distributed algorithm. Equipped with the first-step approximation, the second phase relies on results from the Coreset literature [https://dl.acm.org/doi/10.1145/3406325.3451022] to subsample a relevant set of input points and find a good differentially private clustering solution for the input points. We then prove that this solution generalizes with approximately the same guarantee to the entire input.
VACCINATION SEARCH INSIGHTS VIA DP CLUSTERING
We then apply these advances in differentially private clustering to real-world applications. One example is our application of our differentially-private clustering solution for publishing COVID vaccine-related queries, while providing strong privacy protections for the users.
The goal of Vaccination Search Insights [https://google-research.github.io/vaccination-search-insights/?] (VSI) is to help public health decision makers (health authorities, government agencies and nonprofits) identify and respond to communities' information needs regarding COVID vaccines. In order to achieve this, the tool allows users to explore at different geolocation granularities (zip-code, county and state level in the U.S.) the top themes searched by users regarding COVID queries. In particular, the tool visualizes statistics on trending queries rising in interest in a given locale and time.
[https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjLB2pSTmTxFG_h03ZLJf22m2DhRwjo31ksW7yxde1TLwhqT1nUqhG4ryiEWs8SGwhBAJvGY9Urt6BuhUumdAQ7IHpEHNsECa53M98cXTvucOquUwGlKnq-cXfdryCJB7U3zR9859vZrrnh5ufwfWSE0OIlNUzN_0e0g7dggUHY-AxwtlIx5AF4Risviw/s16000/COVIDSearchStats.png] [https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjLB2pSTmTxFG_h03ZLJf22m2DhRwjo31ksW7yxde1TLwhqT1nUqhG4ryiEWs8SGwhBAJvGY9Urt6BuhUumdAQ7IHpEHNsECa53M98cXTvucOquUwGlKnq-cXfdryCJB7U3zR9859vZrrnh5ufwfWSE0OIlNUzN_0e0g7dggUHY-AxwtlIx5AF4Risviw/s800/COVIDSearchStats.png]Screenshot of the output of the tool. Displayed on the left, the top searches related to Covid vaccines during the period Oct 10-16 2022. On the right, the queries that have had rising importance during the same period and compared to the previous week.To better help identifying the themes of the trending searches, the tool clusters the search queries based on their semantic similarity. This is done by applying a custom-designed k-means–based algorithm run over search data that has been anonymized using the DP Gaussian mechanism to add noise and remove low-count queries (thus resulting in a differentially clustering). The method ensures strong differential privacy guarantees for the protection of the user data.
This tool provided fine-grained data on COVID vaccine perception in the population at unprecedented scales of granularity, something that is especially relevant to understand the needs of the marginalized communities disproportionately affected by COVID. This project highlights the impact of our investment in research in differential privacy, and unsupervised ML methods. We are looking to other important areas where we can apply these clustering techniques to help guide decision making around global health challenges, like search queries on climate change–related challenges [https://blog.google/technology/health/dr-von-nguyens-temperature-check-on-public-health/] such as air quality or extreme heat.
ACKNOWLEDGEMENTS
We thank our co-authors Jacob Imola, Silvio Lattanzi, Jason Lee, Mohammad Mahdian, Vahab Mirrokni, Andres Munoz Medina, Shyam Narayanan, Mark Phillips, David Saulpic, Chris Schwiegelshohn, Sergei Vassilvitskii, Peilin Zhong, and the members of theHealth AI team that made the VSI launch possible: Shailesh Bavadekar, Adam Boulanger, Tague Griffith, Mansi Kansal, Chaitanya Kamath, Akim Kumok, Yael Mayer, Tomer Shekel, Megan Shum, Charlotte Stanton, Mimi Sun, Swapnil Vispute, and Mark Young.
For more information on the Graph Mining team [https://ai.google/research/teams/algorithms-optimization/graph-mining/] (part of Algorithm and Optimization [https://ai.google/research/teams/algorithms-optimization/]) visit our pages.
Differentially private clustering for large-scale datasets Thursday, May 25, 2023 Posted by Vincent Cohen-Addad and Alessandro Epasto, Research Scientists, Google Research, Graph Mining team Clustering is a central problem in unsupervised machine learning (ML) with many applications across domains in both industry and academic research more broadly. At its....