Malcolm Slaney
Dr. Malcolm Slaney
is a Consulting Professor
at Stanford CCRMA,
where he has led the Hearing Seminar for more than 20 years,
and an
Affiliate Faculty
in the Electrical Engineering Department at the University of Washington.
He is a coauthor, with A. C. Kak, of the IEEE book "Principles of Computerized Tomographic Imaging."
This book was republished by SIAM in their
"Classics in Applied Mathematics" Series.
He is coeditor, with Steven Greenberg, of the book
"Computational Models of Auditory Function."
Dr. Slaney has a long career in industry, including doing machine learning and auditory perception work at
Google Research.
Before Google, Dr. Slaney worked at Bell Laboratory, Schlumberger Palo Alto Research, Apple Computer,
Interval Research, IBM's Almaden Research Center, Yahoo! Research, and Microsoft Research.
For many years, he has led the auditory group at the
Telluride Neuromorphic Cognition Engineering Workshop.
Dr. Slaney's recent work is on understanding the role of attention in conversational speech and general audio perception.
Awards:
Publications and Pointers
A more complete list of my publications is at this link:
Publications by Year or on
Google Scholar.
I used to work in the Machine Hearing group, which is part of the
Machine Perception group, at Google Research in Mountain View, CA.
Before that I was in the Speech
and Dialog group at Microsoft Research in Mountain View, CA,
Yahoo!
Research and IBM's Almaden
Research Center. My IBM work is described on this page.
Before that I worked for Interval
Research, Apple Computer's Advanced Technology Group, and
Schlumberger Palo Alto Research.
Several of my technical reports and papers are available on the
net for downloading. The following is a brief list. I have a personal web page for
the fun stuff. Many of my papers can be found online via the IEEE
or the ACM
portals.
This page shows my
auditory perception,
machine learning,
multimedia analysis work,
auditory modeling work,
my signal processing work,
some of my software tools,
and pointers to other work.
Note! My tomography
book is now online. Get more information here. The book
is back in print and you can order it now from SIAM. SIAM
honored us by including it in their "Classics
in Applied Mathematics"series of books!!!
Auditory Perception
Machine Learning
Multimedia Analysis
I spent a few years investigating an algorithm
known as Locality Sensitive Hashing (LSH) that is used to
efficiently find nearest neighbors. I wanted to understand
how to make LSH more efficient. I wrote a tutorial
with Michael Casey and Christoph Rhodes. Then with
colleagues at Yahoo I wrote a "definitive" article about how
to choose the optimum parameters. Both the Matlab
(optimization) and Python (implementation) code is online
too.
|
A tutorial
about LSH.
How to optimize
LSH
GIT
Repository for Matlab and Python Code
|
I wrote a column for IEEE Multimedia
Magazine about my vision of the multimedia world. The
columns are online. |
Vision and
Views |
I get to
work with lots of wonderful image data and some very smart
computer-vision people.
For a couple of years, I worked with Rainer
Lienhart and Eva Hoerster on image classification in large
databases.
|
With Eva and Rainer: CIVR2008,
DAGM2008,
CIVR2007,
ICASSP2007
With Srinivasan: Bipartitate
Models
|
I've been working on finding similar
songs in large music databases with Michael Casey
at Dartmouth and Goldsmiths College, University of
London. We want to find matches that are similar, but
not exact (fingerprinting finds exact matches.)
Michael wrote a great overview of music-information
retrieval, and I helped edit a special
issue
of IEEE Transactions on Audio, Speech and Language
Processing.
I've also been working with William
White from Yahoo's media group to better understand
how to deliver music. We've characterized the diversity
of people's musical interests, studied item-to-item
similarity (using 480,000 subjects), and, most
recently, survey several techniques for content-based
similarity.
And work with Benjamin Marlin when he was an intern at
Yahoo! Research turned into a nice paper about modeling
uncertainty in rating data.
|
Best overview of our music-similarity work is
in IEEE
TASLP. See earlier work at ICASSP
2007, ICASSP
2006 and ISMIR
2006.
2006 - Music
Diversity
2007 - Item
Similarity
2008 - Content
Similarity
|
Auditory Modeling
There is now a new version of
the Auditory Toolbox. It contains Matlab functions to
implement many different kinds of auditory models. The
toolbox includes code for Lyon's passive longwave model,
Patterson's gammatone filterbank, Meddis' hair cell model,
Seneff's auditory model, correlograms and several common
representations from the speech-recognition world
(including MFCC, LPC and spectrograms). This code has been
tested on Macintosh, Windows, and Unix machines using
Matlab 5.2.
Note: This toolbox was originally published as Apple
Computer Technical Report #45. The old technical report
( PDF
PDF and Postscript
) and old code ( Unix
TAR and Macintosh
BinHex ) are available for historical reasons.
|
Auditory
Toolbox
(Version
2.0) |
My primary scientific goal is to
understand how our brains perceive sound. My role in this
research area is a modeler, I build models that explain the
neurophysiological and psychoacoustic data. Hopefully these
models will help other researchers understand the mechanisms
involved and result in better experiments. My latest work in
this area is titled "Connecting Correlograms to
Neurophysiology and Psychoacoustics" and was presented at
the XIth
International
Symposium on Hearing in Grantham England from 1-6
August, 1997. Two correlograms, one computed using
autocorrelation and other other computed using AIM, are
shown on the left. |
Abstract |
I have written
several papers describing how to convert auditory
representations into sounds. I have built models of the
cochlea and central auditory processing, which I hope both
explain auditory processing and will allow us to build
auditory sound separation tools. These papers describe the
process of converting sounds into cochleagrams and
correlograms, and then converting these representations
back into sounds. Unlike the printed versions of this
work, the web page includes audio file examples. It
includes better spectrogram inversion techniques, a
description of how to invert Lyon's passive cochlear
model, and a description of correlogram inversion. This
material was first presented as part of the Proceedings
of the ATR Workshop on "A Biological Framework for
Speech Perception and Production" published in
September 1994. A more refined version of this paper was
an invited talk at the 1994
NIPS
conference. The image on the left shows the
spectrogram of one channel of cochlear output; one step in
the correlogram inversion process. |
ATR
(Kyoto)
Workshop Web Reprint with Sound Examples
Keynote
NIPS
Conference Paper (Postscript)
|
Pattern Playback is
the term used by Frank Cooper to describe his successful
efforts to paint spectrogram on plastic and then convert
them into sound. I wrote of Pattern Playback techniques,
from Frank Cooper's efforts to my own efforts with
auditory model inversion, in a paper which was published
at the 1995 IEEE International Conference on Systems,
Man, and Cybernetics. My paper is titled "Pattern
Playback from 1950 to 1995". The image at the left shows a
portion of one of Cooper's spectrograms. |
Web
Version
Postscript
(1.8M)
Adobe
PDF
(227k)
|
The following
are publications during my time at Apple. The Mathematica notebooks
are
designed to be self-documenting and in each case the postscript and
PDF files are also available. Those files that are Matlab toolboxes
include
source and documentation All these files are available with the
gracious permission of Apple.
"Auditory Model Inversion for Sound
Separation" is the first paper to describe correlogram
inversion techniques. We also discuss improved methods for
inverting spectrograms and a cochlear model designed by
Richard F. Lyon. This paper was published at ICASSP '94. |
Postscript
(1.5M)
Adobe
PDF
(243k)
Online
patent
|
"A Perceptual Pitch Detector" is a paper
that describes a model of human pitch perception. It is
similar to work done by Meddis and Hewitt and published in
JASA, but this paper has more real-world examples. This
paper was published at ICASSP '90. |
Postscript
(3M)
Adobe
PDF
(315k)
|
"On the importance of time" is an invited
chapter by Dick Lyon and myself in the book Visual
Representations
of Speech Signals (edited by Martin Cooke,
Steve Beet and Malcolm Crawford, John Wiley & Sons).
This tutorial describes the reason that we think
time-domain processing is important when modeling the
cochlea and higher-level processing. |
Postscript
Adobe
PDF
|
A software package called MacEar implements
the latest version of Lyon's Cochlear Model. MacEar is
written in very portable C for Unix and Macintosh
computers. This link points to the last published version
(2.2). (Note the README file included has old program
results. The names of the output files have changed and
there are a couple of extra channels being output. I'm
sorry for the confusion.) |
Unix
Shell Archive with Sources |
Signal Processing
I recently finished some nice work
establishing a linear operator connecting the audio and
video of a speaker. A paper describing this work has been
accepted for presentation at the NIPS'2000 conference. |
PDF
Paper
(600k) |
Chris Bregler, Michele Covell,
and I developed a technique we call Video Rewrite to
automatically synthesize video of talking heads. This
technology is cool because we use a purely data driven
approach (concatenative triphone video synthesis) to
create new video of a person speaking. Given new audio, we
concatenate the best sequence of lip images and morph them
into a background sequence. We can automatically create
sequences like the Kennedy and Johnson scenes in the movie
"Forrest Gump." |
Original
SIGGRAPH
'97 Paper (with examples)
Audio
Visual
Speech Perception Workshop
|
We studied how adults convey
affective messages to infants using prosody. We did not
attempt to recognize the words, let alone to distill more
nebulous concepts such as satire or irony. We analyzed
speech with low-level acoustic features and discriminated
approval, attentional bids, and prohibitions from adults
speaking to their infants. We built automatic classifiers
to create a system, Baby Ears, that performs the task that
comes so naturally to infants. The image on the left shows
one of the decision surfaces which classifies approval,
attention and prohibition utterances on the basis of their
pitch. |
Web
Page
Postscript
(189k)
Adobe
PDF
(42k)
|
We wrote a more detailed article describing
this work for the journal Speech
Communications. We can't post that article,
but I can send you a copy if you send me email. |
Send email
for a copy of journal article. |
I was able to help Michele Covell do some
neat work on time-compression of audio. Lots of people
know how to compress a speech utterance by a constant
amount. But if you want to do better, which parts of the
speech signal can be compressed the most? This paper
describes a good technique and shows how to test the
resulting comprehension. |
Conference
Paper
Technical
Report
with Audio Samples
|
Eric Scheirer and I worked on a
system for discriminating between speech and music in an
audio signal. This paper describes a large number of
features, how they can be combined into a statistical
framework, and the resulting performance on discriminating
signals found on radio stations. The results are better
then anybody else's results. (That comparison is not
necessarily valid since there are no common testing
databases. We did work hard to make our test set
representative.) This paper was published at the 1997
ICASSP in Munich. The image on the left shows clouds of
our data. |
Web
Page
Postscript
(349k)
Adobe
PDF
(263k)
|
Work we've done to morph between
two sounds is described in a paper at the 1996 ICASSP.
This work is new because it extends previous audio
morphing work to include inharmonic sounds. This paper
uses results from Auditory Scene Analysis to represent,
match, warp, and then interpolate between two sounds. The
image on the left shows the smooth spectrogram, one of two
independent representations used when morphing audio
signals. |
Web
Page
Postscript
(3M)
Adobe
PDF
(237k)Patent
|
I wrote an article describing my experiences writing
"intelligent" signal processing documents. My Mathematica
notebook "Lyon's Cochlear Model" was the first large
document written with Mathematica. While I don't use
Mathematica as much as I used to, I still believe that
intelligent documents are a good way to publish scientific
results. These ideas were also published in a book titled
"Knowledge Based Signal Processing" that was published by
Prentice Hall. |
KBSP
Book
Chapter in Adobe PDF (3M)
IEEE
Signal
Processing Article in Adobe PDF (2M)
|
Software Publications
I have written Matlab m-functions
that read and write QuickTime movies. The WriteQTMovie code
is more general than previous solutions for creating movies
in Matlab. It runs on any platform that Matlab runs on. It
also lets you add sound to the movie. The ReadQTMovie code
reads and parses JPEG compressed moves. |
Matlab
Source
Code |
Chris
Bregler and I coded an implementation of an image
processing technique known as snakes. There are two m-files
that implement a type of dynamic contour following popular
in computer vision. First proposed by Kass, Witkin and
Terzopoulos in 1987, snakes are a variational technique to
find the best contour that aligns with an image. The basic
routine, snake.m, aligns a sequence of points along a
contour to the maximum of an array or image. Provide it with
an image, a set of starting points, limits on the search
space and it returns a new set of points that better align
with the image. The second m-file is a demonstration script.
Using your own array of image data, or a built-in default, a
demo window is displayed where you can click to indicate
points and see the snake program in action. |
Matlab
Source
Code
Matlab
Demonstration
Source
|
Michele Covell and I wrote some Matlab code
to compute multi-dimensional scaling (MDS). MDS allows you
to reconstruct an estimate of the position of points,
given just relative distance data. These routines do both
metric (where you know distances) and non-metric (where
you just now the order of distances) data. |
Technical
report
containing the code (no documentation). |
Apple Publications
The SoundAndImage toolbox is a collection of Matlab
tools to make it easier to work with sounds and images. On
the Macintosh, tools are provided to record and playback
sounds through the sound system, and to copy images to and
from the scrapbook. For both Macintosh and Unix system,
routines are provided to read and write many common sound
formats (including AIFF). Only 68k MEX files are included.
Users on other machines will need to recompile the software.
This toolbox is published as Apple Computer Technical Report
#61. |
Postscript
Documentation
(153k)
Adobe
PDF
Documentation (20k)
Macintosh
Archive
|
I created a Hypercard stack to make it
easier for people with a Macintosh and CDROM drive to
interact with the Acoustical Society of America's Auditory
Demonstrations
CD. This CD is a wonderful collection of auditory
effects and principles. The ASA Demo Hypercard stack
includes the text and figures from the book and lets you
browse the Audio CD. |
Macintosh
Archive |
I wrote a program for the Macintosh 660/AV
and 840/AV computers that uses the DSP (AT&T3210) to
monitor audio levels. VUMeters runs on any Macintosh with
the AT&T DSP chip. Source and binaries are included. |
Macintosh
Archive |
Bill Stafford and I wrote TCPPlay to allow us
to play sounds from a Unix machine over the network to the
Macintosh on our desks. This archive includes Macintosh
and Unix source code and the Macintosh application. There
are other network audio solutions, but this works well on
the Macintosh. |
Macintosh
Archive |
Previous Publications
In a past life, I worked on
medical imaging. A book on tomographic imaging
(cross-sectional x-ray imaging) was published by IEEE Press:
Avinash C. Kak and Malcolm Slaney, Principles of
Computerized Tomographic Imaging, (New York : IEEE
Press, c1988). The software used to generate many of the
tomographic images in this book is available. The parallel
beam reconstruction on the left was generated with the
commands
gen n=100 k=100 if=lib.d.s
filt n=100 k=100
back n=100 k=100
disn min=1.0 max=1.05
|
Tomographic
Software
(Unix TAR format)
Tomographic
Software
(Shell archive)
The book is now online. Download the PDF or
order the book from SIAM)
|
Carl Crawford, Mani Azimi and I wrote a simple
Unix plotting package called qplot. Both two-dimensional and
3d-surface plots are supported. |
Compressed
Unix
TAR File |
Now obsolete code to implement a DITroff
previewer under SunView is available. This program was
called suntroff and is an ancestor of the X Window
System Troff previewer. It was written while I was an
employee of Schlumberger Palo Alto Research. All files are
compressed Unix TAR files. |
Source
LaserWriter
fonts
Complete
package
|
Other Research Pointers
I organize the Stanford CCRMA Hearing Seminar.
Just about any topic related to auditory perception is considered
fair game at the seminar. An archive of seminar announcements can be
found at Stanford
(organized
as a table) or at UCSC as a
chronological listing of email announcements. Send email to [email protected]
if you would like to be added to the mailing list.
For more Information
I can be reached at
Malcolm Slaney
The best way to reach me is to send email.
This page last updated on September 3, 2012.
Malcolm Slaney (
[email protected])