The workshop will feature four keynotes by the following speakers.
Farfield speech recognition has become a popular research area in the past few years, from more research focused activities such as the CHiME Challenges, to the launches of Amazon Echo and Google Home. This talk will describe the research efforts around Google Home. Most multichannel ASR systems commonly separate speech enhancement, including localization, beamforming and postfiltering, from acoustic modeling. In this talk, we will introduce a framework to do multichannel enhancement jointly with acoustic modeling using deep neural networks.
Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture which performs multichannel filtering in the first layer of the network and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction. Next, we show how performance can be improved by factoring the first layer to separate the multichannel spatial filtering operation from a single channel filterbank which computes a frequency decomposition. We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs. Finally we demonstrate that these approaches can be implemented more efficiently in the frequency domain.
Tara Sainath received her B.S (2004), M. Eng (2005) and Ph.D. (2009) in Electrical Engineering and Computer Science all from MIT. The main focus of her PhD work was in acoustic modeling for noise robust speech recognition. After her PhD, she spent 5 years at the Speech and Language Algorithms group at IBM T.J. Watson Research Center, before joining Google Research. She has co-organized a special session on Sparse Representations at Interspeech 2010 in Japan. She has also organized a special session on Deep Learning at ICML 2013 in Atlanta. In addition, she is a staff reporter for the IEEE Speech and Language Processing Technical Committee (SLTC) Newsletter. Her research interests are mainly in acoustic modeling and deep neural networks.
Frank K. Soong, Principal Researcher/Research Manager
Speech Group, Microsoft Research Asia (MSRA)
A person’s speech is strongly conditioned by his own articulators and the language(s) he speaks, hence rendering speech in an inter-speaker or inter-language manner from a source speaker’s speech data collected in his native language is both academically challenging and technology/application desirable. The quality of the rendered speech is assessed in three dimensions: naturalness, intelligibility and similarity to the source speaker. Usually, the three criteria cannot be all met when rendering is done in both cross-speaker and cross-language ways. We will analyze the key factors of rendering quality in both acoustic and phonetic domains objectively. Monolingual speech databases but recorded by different speakers or bilingual ones recorded by the same speaker(s) are used. Measures in the acoustic space and phonetic space are adopted to quantify naturalness, intelligibility and speaker’s timber objectively. Our “trajectory tiling” algorithm-based, cross-lingual TTS is used as the baseline system for comparison. To equalize speaker difference automatically, DNN-based ASR acoustic model trained speaker independently is used. Kullback-Leibler Divergence is proposed to statistically measure the phonetic similarity between any two given speech segments, which are from different speakers or languages, in order to select good rendering candidates. Demos of voice conversion, speaker adaptive TTS, cross-lingual TTS will be shown either inter-speaker or inter-language wise, or both. The implications of this research on low-resourced speech research, speaker adaptation, “average speaker’s voice”, accented/dialectical speech processing, speech-to-speech translation, audio-visual TTS, etc. will be discussed.
Frank K. Soong is a Principal Researcher and Research Manager, Speech Group, Microsoft Research Asia (MSRA), Beijing, China, where he works on fundamental research on speech and its practical applications. His professional research career spans over 30 years, first with Bell Labs, US, then ATR, Japan, before joining MSRA in 2004. At Bell Labs, where he worked on stochastic modeling of speech signals, optimal decoding algorithm, speech analysis and coding, speech and speaker recognition. He was responsible for developing the recognition algorithm which was developed into voice-activated mobile phone products rated by the Mobile Office Magazine (Apr. 1993) as the “outstandingly the best”. He is a co-recipient of the Bell Labs President Gold Award for developing the Bell Labs Automatic Speech Recognition (BLASR) system. He has served as a member of the Speech and Language Technical Committee, IEEE Signal Processing Society and other society functions, including Associate Editor of the IEEE Speech and Audio Transactions and chairing IEEE Workshop. He published extensively with more than 200 papers and co-edited a widely used reference book, Automatic Speech and Speech Recognition- Advanced Topics, Kluwer, 1996. He is a visiting professor of the Chinese University of Hong Kong (CUHK) and a few other top-rated universities in China. He is also the co-Director of the National MSRA-CUHK Joint Research Lab. He got his BS, MS and PhD from National Taiwan Univ., Univ. of Rhode Island, and Stanford Univ., all in Electrical Eng. He is an IEEE Fellow “for contributions to digital processing of speech”.
DeepMind - Carnegie Mellon University
End-to-end learning of neural networks that directly model p(output | input) has simultaneously improved the performance and simplified the complexity of the modular systems that were widely used in many areas of applied machine learning until just a few years ago. This pattern of success has been repeated in domains as varied as speech recognition and synthesis, computer vision, and natural language processing (NLP). In this talk, I ask the question of what generative models—that is, neural network models of the joint distribution p(input, output) obtained via a “noisy channel” factorization p(output) × p(input | output)—still have to tell us about how to solve difficult prediction problems. I review a series of experiments in the NLP domain that compare the performance of generative and discriminative models on classically “discriminative” tasks including text classification, syntactic parsing, machine translation, and text summarization. The results demonstrate that generative models have several advantages over their discriminative counterparts, including better sample complexity, more straightforward use of unpaired training data (e.g., incorporate an independent language model), robustness to “label bias” in sequence transduction problems, good outlier detection, and the ability to incorporate knowledge of natural processes into model construction. These results come at the cost of increased inferential complexity, and I conclude with some discussion about how this might be remedied.
Chris Dyer is a research scientist at Google DeepMind and an assistant professor in the School of Computer Science at Carnegie Mellon University. In 2017, he received the Presidential Early Career Award for Scientists and Engineers (PECASE). His work has occasionally been nominated for best paper awards in prestigious NLP venues and has, much more occasionally, won them. He lives in London and, in his spare time, plays cello.
provided by Osaka University
Osaka University - ATR
We are developing various conversational robots in Osaka University and ATR. This talk introduces the robots and discusses on fundamental issues. Especially, it focuses on feeling of presence, so-called "sonzaikan" in Japanese and dialogue as the fundamental issues.
Hiroshi Ishiguro (M’) received a D.Eng. in systems engineering from the Osaka University, Japan in 1991. He is currently Professor of Department of Systems Innovation in the Graduate School of Engineering Science at Osaka University (2009-), and visiting Director (2014-) (group leader: 2002-2013) of Hiroshi Ishiguro Laboratories at the Advanced Telecommunications Research Institute and an ATR fellow. His research interests include distributed sensor systems, interactive robotics, and android science. He has published more than 300 papers in major journals and conferences, such as Robotics Research and IEEE PAMI. On the other hand, he has developed many humanoids and androids, called Robovie, Repliee, Geminoid, Telenoid, and Elfoid. These robots have been reported many times by major media, such as Discovery channel, NHK, and BBC. He has also received the best humanoid award four times in RoboCup. In 2011, he won the Osaka Cultural Award presented by the Osaka Prefectural Government and the Osaka City Government for his great contribution to the advancement of culture in Osaka. In 2015, he received the Prize for Science and Technology (Research Category) by the Minister of Education, Culture, Sports, Science and Technology (MEXT). He was also awarded the Sheikh Mohammed Bin Rashid Al Maktoum Knowledge Award in Dubai in 2015.
The workshop will feature 6 talks by the following invited speakers.
IBM Watson Group
We live in an era where more and more tasks, once thought to be impregnable bastions of human intelligence, succumb to AI. Are we at the cusp where ASR systems have matched expert humans in conversational speech recognition? We try to answer this question with some experimental evidence on the Switchboard English conversational telephony corpus. On the human side, we describe some listening experiments which established a new human performance benchmark. On the ASR side, we discuss a series of deep learning architectures and techniques for acoustic and language modeling that were instrumental in lowering the word error rate to record levels on this task.
Dr. Saon is a Principal Research Staff Member in the Watson Multimodal Group at the IBM T. J. Watson Research Center. He received his M.Sc. and Ph.D. degrees in computer science from Henri Poincare University in Nancy, France, in 1994 and 1997, respectively. In 1995, Dr. Saon obtained his engineer diploma from the Polytechnic University of Bucharest, Romania. From 1994 to 1998, he worked on two-dimensional stochastic models for off-line handwriting recognition at the Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA). Since joining IBM in 1998, Dr. Saon worked on a variety of problems spanning several areas of large vocabulary continuous speech recognition such as discriminative feature processing, acoustic modeling, speaker adaptation and large vocabulary decoding algorithms. Since 2001, Dr. Saon has been a key member of IBM’s speech recognition team that participated in a variety of U.S. government-sponsored evaluations. He has published more than 100 conference and journal papers and holds several patents in the field of speech recognition. He is the recipient of two best paper awards (INTERSPEECH 2010 and ASRU 2011) and has served as an elected member of the IEEE Speech and Language Technical Committee from 2012 to 2016.
Imperial College London / University of Augsburg
Human performance is often appearing as a glass ceiling when it comes to automatic speech and speaker analysis. In some tasks, such as health monitoring, however, automatic analysis has successfully started to break this ceiling. The field has benefited from more than a decade of deep neural learning approaches such as recurrent LSTM nets and deep RBMs by now; however, recently, a further major boost could be witnessed. This includes the injection of convolutional layers for end-to-end learning, as well as active and autoencoder-based transfer learning and generative adversarial network topologies to better cope with the ever-present bottleneck of severe data scarcity in the field. At the same time, multi-task learning allowed to broaden up on tasks handled in parallel and include the often met uncertainty in the gold standard due to subjective labels such as emotion or perceived personality of speakers. This talk highlights the named and further latest trends such as increasingly deeper nets and the usage of deep image nets for speech analysis on the road to 'holistic' superhuman speech analysis 'seeing the whole picture' of the person behind a voice. At the same time, increasing efficiency is shown for an ever 'bigger' data and increasingly mobile application world that requires fast and resource-aware processing. The exploitation in ASR and SLU is featured throughout.
Björn W. Schuller heads Imperial College London's/UK Group on Language Audio & Music (GLAM), is a CEO of audEERING, and a Full Professor at University of Augsburg/Germany in CS. He further holds a Visiting Professorship at the Harbin Institute of Technology/China. He received his diploma, doctoral, and habilitation degrees from TUM in Munich/Germany in EE/IT. Previous positions of his include Visiting Professor, Associate, and Scientist at VGTU/Lithuania, University of Geneva/Switzerland, Joanneum Research/Austria, Marche Polytechnic University/Italy, and CNRS-LIMSI/France. His 650+ technical publications (15000+ citations, h-index 59) focus on machine intelligence for audio and signal analysis. He is the Editor in Chief of the IEEE Transactions on Affective Computing, a General Chair of ACII 2019, and a Technical Chair of Interspeech 2019 among various further roles.
Senior Staff Research Scientist at Google
Machine learning and in particular neural networks have made great advances in the last few years for products that are used by millions of people, most notably in speech recognition, image recognition and most recently in neural machine translation. Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which addresses many of these issues. The model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To accelerate final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units for both input and output. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using human side-by-side evaluations it reduces translation errors by more than 60% compared to Google's phrase-based production system. The new Google Translate was launched in late 2016 and has improved translation quality significantly for all Google users.
Dr. Mike Schuster graduated in Electric Engineering from the Gerhard-Mercator University in Duisburg, Germany in 1993. After receiving a scholarship he spent a year in Japan to study Japanese in Kyoto and Fiber Optics in the Kikuchi laboratory at Tokyo University. His professional career in machine learning and speech brought him to Advanced Telecommunications Research Laboratories in Kyoto, Nuance in the US and NTT in Japan where he worked on general machine learning and speech recognition research and development after getting his PhD at the Nara Institute of Science and Technology. Dr. Schuster joined the Google speech group in the beginning of 2006, seeing speech products being developed from scratch to toy demos to serving millions of users in many languages over the next eight years, and he was the main developer of the original Japanese and Korean speech recognition models. He is now part of the Google Brain group which focuses on building large-scale neural network and machine learning infrastructure for Google and has been working on infrastructure with the TensorFlow toolkit as well as on research, mostly in the field of speech and translation with various types of recurrent neural networks. In 2016 he led the development of the new Google Neural Machine Translation system, which reduced translation errors by more than 60% compared to the previous system.
IBM Fellow, IBM Research
IBM Distinguished Service Professor, Carnegie Mellon University
Computers have been changing the lives of the blind people. Voice synthesis technology has improved their educational environment and job opportunities by allowing them to access online services. Now, the new AI technologies are reaching the point where computers can help in sensing, recognizing, and understanding our living world, real-world. I will first introduce the concept of cognitive assistant for the blind, which will help blind and visually impaired to explore surroundings and enjoy city environment by assisting their missing visual sense by the power of integrated AI technologies. I will then introduce the latest technologies including the accurate indoor navigation system and the personal object recognition system, followed by the discussion of the role of the blind - how we can accelerate the advancement of AI technologies.
Chieko Asakawa has been instrumental in furthering accessibility research and development for three decades. Series of pioneering technologies led by Chieko significantly contributed in advancing web accessibility and usability, including groundbreaking work in digital braille and voice web browser. Today, Chieko is focusing on advancing cognitive assistant research to help the blind regain information by augmenting missing or weakened abilities in the real world.
Chieko is a member of the Association for Computing Machinery (ACM), the National Academy of Engineering (NAE), and IBM Academy of Technology. She became an IBM Fellow in 2009, IBM’s most prestigious technical honor. She won the Medal of Honor with Purple Ribbon from the government of Japan in 2013. She has also been serving as an IBM Distinguished Service Professor at Carnegie Mellon University since 2014.
Assistant Professor at Electrical and Computer Engineering
Center for Language and Speech Processing
Johns Hopkins University
The speech signal is complex and contains a tremendous quantity of diverse information. The first step of extracting this information is to define an efficient representation that can model as much information as possible and will facilitate the extraction process. The I-vector representation is a statistical data-driven approach for feature extraction, which provides an elegant framework for speech classification and identification in general. This representation became the state of the art in several speech processing tasks and has been recently integrated with deep learning methods. This talk will focus on presenting variety of applications of the I-vector representation for speech and audio tasks including speaker profiling, speaker diarization and speaker health analysis. We will also show the possibility of using this representation to model and visualize information present in deep neural network hidden layers.
Najim Dehak received his PhD from School of Advanced Technology, Montreal in 2009. During his PhD studies he worked with the Computer Research Institute of Montreal, Canada. He is well known as a leading developer of the I-vector representation for speaker recognition. He first introduced this method, which has become the state-of-the-art in this field, during the 2008 summer Center for Language and Speech Processing workshop at Johns Hopkins University. This approach has become one of most known speech representations in the entire speech community.
Dr. Dehak is currently a faculty member of the Department of Electrical & Computer Engineering at Johns Hopkins University. Prior to joining Johns Hopkins, he was a research scientist in the Spoken Language Systems Group at the MIT Computer Science and Artificial Intelligence Laboratory. His research interests are in machine learning approaches applied to speech processing, audio classification, and health applications. He is a senior member of IEEE and member of the IEEE Speech and Language Technical Committee.
University of Illinois at Urbana-Champaign
As commonplace speech-enabled devices are getting smaller and lighter, we are faced with a need for simpler processing and simpler hardware. In this talk I will present some alternative ways to approach multi-channel and single-channel speech enhancement under these constraints. More specifically, I will talk about new ways to formulate beamforming that are numerically more lightweight, and operate best when using physically compact arrays, and then I will discuss single-channel approaches using a deep network which, in addition to imposing a lightweight computational load, are amenable to aggressive hardware optimizations that can result in massive power savings and reductions in hardware footprint.
Paris Smaragdis is an associate professor at the Computer Science and the Electrical and Computer Engineering departments of the University of Illinois at Urbana-Champaign, as well as a senior research scientist at Adobe Research. He completed his masters, PhD, and postdoctoral studies at MIT, performing research on computational audition. In 2006 he was selected by MIT’s Technology Review as one of the year’s top young technology innovators for his work on machine listening, in 2015 he was elevated to an IEEE Fellow for contributions in audio source separation and audio processing, and during 2016-2017 he is an IEEE Signal Processing Society Distinguished Lecturer. He has authored more than 100 papers on various aspects of audio signal processing, holds more than 40 patents worldwide, and his research has been productized by multiple companies.