About
Jason is presently serving as the Co-Founder and Chief Technology Officer at Xembly…
Articles by Jason
Activity
-
“You really have to ground your AI systems for enterprise use cases, imagine a nurse in a hospital system using AI to make some decision about…
“You really have to ground your AI systems for enterprise use cases, imagine a nurse in a hospital system using AI to make some decision about…
Liked by Jason Flaks
-
I’ve come to realize that #LLMs are the MP3’s of machine learning models for discriminative #NLP tasks. They aren’t as good, only experts can tell…
I’ve come to realize that #LLMs are the MP3’s of machine learning models for discriminative #NLP tasks. They aren’t as good, only experts can tell…
Posted by Jason Flaks
-
Today marks my first day at my new job with Sonos, Inc. here in Seattle! I'll be working as a senior SDET on the engineering team designing test…
Today marks my first day at my new job with Sonos, Inc. here in Seattle! I'll be working as a senior SDET on the engineering team designing test…
Liked by Jason Flaks
Publications
-
Audio Spectrogram Factorization for Classification of Telephony Signals below the Auditory Threshold
arxiv.org
Traffic Pumping attacks are a form of high-volume SPAM that target telephone networks, defraud customers and squander telephony resources. One type of call in these attacks is characterized by very low-amplitude signal levels, notably below the auditory threshold. We propose a technique to classify so-called "dead air" or "silent" SPAM calls based on features derived from factorizing the caller audio spectrogram. We describe the algorithms for feature extraction and classification as well as…
Traffic Pumping attacks are a form of high-volume SPAM that target telephone networks, defraud customers and squander telephony resources. One type of call in these attacks is characterized by very low-amplitude signal levels, notably below the auditory threshold. We propose a technique to classify so-called "dead air" or "silent" SPAM calls based on features derived from factorizing the caller audio spectrogram. We describe the algorithms for feature extraction and classification as well as our data collection methods and production performance on millions of calls per week.
Other authorsSee publication -
The Marchex 2018 English Conversational Telephone Speech Recognition System
arxiv.org
In this paper, we describe recent performance improvements to the production Marchex speech recognition system for our spontaneous customer-to-business telephone conversations. In our previous work, we focused on in-domain language and acoustic model training. In this work we employ state-of-the-art semi-supervised lattice-free maximum mutual information (LF-MMI) training process which can supervise over full lattices from unlabeled audio. On Marchex English (ME), a modern evaluation set of…
In this paper, we describe recent performance improvements to the production Marchex speech recognition system for our spontaneous customer-to-business telephone conversations. In our previous work, we focused on in-domain language and acoustic model training. In this work we employ state-of-the-art semi-supervised lattice-free maximum mutual information (LF-MMI) training process which can supervise over full lattices from unlabeled audio. On Marchex English (ME), a modern evaluation set of conversational North American English, we observed a 3.3% (3.2% for agent, 3.6% for caller) reduction in absolute word error rate (WER) with 3x faster decoding speed over the performance of the 2017 production system. We expect this improvement boost Marchex Call Analytics system performance especially for natural language processing pipeline.
Other authorsSee publication -
Semi-Supervised Model Training for Unbounded Conversational Speech Recognition
arXiv.org
For conversational large-vocabulary continuous speech recognition (LVCSR) tasks, up to about two thousand hours of audio is commonly used to train state of the art models. Collection of labeled conversational audio however, is prohibitively expensive, laborious and error-prone. Furthermore, academic corpora like Fisher English (2004) or Switchboard (1992) are inadequate to train models with sufficient accuracy in the unbounded space of conversational speech. These corpora are also timeworn due…
For conversational large-vocabulary continuous speech recognition (LVCSR) tasks, up to about two thousand hours of audio is commonly used to train state of the art models. Collection of labeled conversational audio however, is prohibitively expensive, laborious and error-prone. Furthermore, academic corpora like Fisher English (2004) or Switchboard (1992) are inadequate to train models with sufficient accuracy in the unbounded space of conversational speech. These corpora are also timeworn due to dated acoustic telephony features and the rapid advancement of colloquial vocabulary and idiomatic speech over the last decades. Utilizing the colossal scale of our unlabeled telephony dataset, we propose a technique to construct a modern, high quality conversational speech training corpus on the order of hundreds of millions of utterances (or tens of thousands of hours) for both acoustic and language model training. We describe the data collection, selection and training, evaluating the results of our updated speech recognition system on a test corpus of 7K manually transcribed utterances. We show relative word error rate (WER) reductions of {35%, 19%} on {agent, caller} utterances over our seed model and 5% absolute WER improvements over IBM Watson STT on this conversational speech task.
Other authorsSee publication -
The New Wave of Robocallers Costing Businesses Billions
Marchex
Marchex examined more than 300 million calls placed to businesses in 2014 through its Marchex Call Analytics platform and found that call centers across the U.S. receive more than 100 million spam calls a year. That's a cost of about $1 billion. Is your business protected?
-
Quality of Service (QoS) for Streaming Audio Over Wireless LANs
Audio Engineering Society
Streaming audio in computer networks requires a level of quality of service above and beyond the "best-effort" service that is typically provided. The need for enhanced QoS is even greater in wireless networks where issues of interference, security, roaming, and bandwidth constraints are added. This paper discusses QoS issues important for providing high quality streaming audio over wireless networks. In addition an overview of future QoS enhancements in IEEE 802.11e and HomeRF 2.0 is provided.
-
Pseudo-Continous Source Independent Load Monitoring and Applications Through Loudspeaker Impedance Analysis. Flaks, Jason
Master's Research Project at the University of Miami
Loudspeaker measurements in public facilities can often be difficult to perform due to the intrusive nature of the test signals. This paper proposes a pseudo-continuous source-independent technique for measuring loudspeaker impedance, which allows for the use of program music as a test signal. Using Fourier analysis to obtain frequency domain data of both voltage and current, an accurate plot of impedance versus frequency can be made. Averaging, smoothing, and signal thresholding are used to…
Loudspeaker measurements in public facilities can often be difficult to perform due to the intrusive nature of the test signals. This paper proposes a pseudo-continuous source-independent technique for measuring loudspeaker impedance, which allows for the use of program music as a test signal. Using Fourier analysis to obtain frequency domain data of both voltage and current, an accurate plot of impedance versus frequency can be made. Averaging, smoothing, and signal thresholding are used to further increase amplitude accuracy. The impedance plot compared to a preexisting reference can alert the user to possible loudspeaker problems that can arise over time. The validity of making such measurements is given by loudspeaker impedance analysis using a derived impedance equation from a loudspeaker equivalent circuit and experimental techniques.
-
Speech de-esser using adaptive filters
Intl. Conf. Signal Proc. & Tech. ICSPAT, 1999
-
Global Musical Instrument Communication Standard (GMICS(tm)): An Integrated Digital Audio and Control Communication Specification for Instruments
Audio Engineering Society Preprints
This paper provides an in-depth look at the Global Musical Instrument Communication Standard (GMICS-) from the electrical, physical, data link, and control perspective. Using the 100-megabit Ethernet physical layer, and newly defined data link and control protocols, GMICS provides a low latency digital audio and control highway appropriate for instruments, especially in live performance situations where delay is intolerable
-
RTP Payload Format for AC-3 Audio - RFC 4184
IETF
This document describes an RTP payload format for transporting audio
data using the AC-3 audio compression standard. AC-3 is a high
quality, multichannel audio coding system that is used for United
States HDTV, DVD, cable television, satellite television and other
media. The RTP payload format presented in this document includes
support for data fragmentation.
Patents
-
Fusing virtual content into real content
Issued USPTO 08884984
A system that includes a head mounted display device and a processing unit connected to the head mounted display device is used to fuse virtual content into real content. In one embodiment, the processing unit is in communication with a hub computing device. The system creates a volumetric model of a space, segments the model into objects, identifies one or more of the objects including a first object, and displays a virtual image over the first object on a display (of the head mounted display)…
A system that includes a head mounted display device and a processing unit connected to the head mounted display device is used to fuse virtual content into real content. In one embodiment, the processing unit is in communication with a hub computing device. The system creates a volumetric model of a space, segments the model into objects, identifies one or more of the objects including a first object, and displays a virtual image over the first object on a display (of the head mounted display) that allows actual direct viewing of at least a portion of the space through the display.
-
Sound Source Separation Using Spatial Filtering and Regularization Phases
Issued United States 8,583,428
Described is a multiple phase process/system that combines spatial filtering with regularization to separate sound from different sources such as the speech of two different speakers. In a first phase, frequency domain signals corresponding to the sensed sounds are processed into separated spatially filtered signals including by inputting the signals into a plurality of beamformers (which may include nullformers) followed by nonlinear spatial filters. In a regularization phase, the separated…
Described is a multiple phase process/system that combines spatial filtering with regularization to separate sound from different sources such as the speech of two different speakers. In a first phase, frequency domain signals corresponding to the sensed sounds are processed into separated spatially filtered signals including by inputting the signals into a plurality of beamformers (which may include nullformers) followed by nonlinear spatial filters. In a regularization phase, the separated spatially filtered signals are input into an independent component analysis mechanism that is configured with multi-tap filters, followed by secondary nonlinear spatial filters. Separated audio signals are the provided via an inverse-transform.
-
COMPOUND GESTURE-SPEECH COMMANDS
Issued United States 8,296,151
A multimedia entertainment system combines both gestures and voice commands to provide an enhanced control scheme. A user's body position or motion may be recognized as a gesture, and may be used to provide context to recognize user generated sounds, such as speech input. Likewise, speech input may be recognized as a voice command, and may be used to provide context to recognize a body position or motion as a gesture. Weights may be assigned to the inputs to facilitate processing. When a…
A multimedia entertainment system combines both gestures and voice commands to provide an enhanced control scheme. A user's body position or motion may be recognized as a gesture, and may be used to provide context to recognize user generated sounds, such as speech input. Likewise, speech input may be recognized as a voice command, and may be used to provide context to recognize a body position or motion as a gesture. Weights may be assigned to the inputs to facilitate processing. When a gesture is recognized, a limited set of voice commands associated with the recognized gesture are loaded for use. Further, additional sets of voice commands may be structured in a hierarchical manner such that speaking a voice command from one set of voice commands leads to the system loading a next set of voice commands
-
Strongly typed tags
Issued United States 8,041,738
In one or more embodiments, a tag is provided and includes a property that associates a strongly typed variable with the tag. Strongly typed variables can include any suitable types. For example, in at least some embodiments, the strongly typed variable is a people type that allows the tag to be associated with an individual person or group of people by virtue of a unique identification that is associated with the person or group. Strongly typed tags can then serve as a foundation upon which…
In one or more embodiments, a tag is provided and includes a property that associates a strongly typed variable with the tag. Strongly typed variables can include any suitable types. For example, in at least some embodiments, the strongly typed variable is a people type that allows the tag to be associated with an individual person or group of people by virtue of a unique identification that is associated with the person or group. Strongly typed tags can then serve as a foundation upon which various other types of information and services can be provided to enhance the user experience.
-
ADAPTIVE AMBIENT SOUND SUPPRESSION AND SPEECH TRACKING
Issued United States 8,219,394
A device for suppressing ambient sounds from speech received by a microphone array is provided. One embodiment of the device comprises a microphone array, a processor, an analog-to-digital converter, and memory comprising instructions stored therein that are executable by the processor. The instructions stored in the memory are configured to receive a plurality of digital sound signals, each digital sound signal based on an analog sound signal originating at the microphone array, receive a…
A device for suppressing ambient sounds from speech received by a microphone array is provided. One embodiment of the device comprises a microphone array, a processor, an analog-to-digital converter, and memory comprising instructions stored therein that are executable by the processor. The instructions stored in the memory are configured to receive a plurality of digital sound signals, each digital sound signal based on an analog sound signal originating at the microphone array, receive a multi-channel speaker signal, generate a monophonic approximation signal of the multi-channel speaker signal, apply a linear acoustic echo canceller to suppress a first ambient sound portion of each digital sound signal, generate a combined directionally-adaptive sound signal from a combination of each digital sound signal by a combination of time-invariant and adaptive beamforming techniques, and apply one or more nonlinear noise suppression techniques to suppress a second ambient sound portion of the combined directionally-adaptive sound signal.
-
Integrating security by obscurity with access control lists
Issued United States 7,984,512
Aspects of the subject matter described herein relate to providing and restricting access to content. In aspects, information (e.g., a URL) that identifies content and a user is provided to a user. In conjunction with providing the information to a user, a data structure (e.g., an access control list) is updated to indicate that the user has access to the content. The user may use the information to access the content and/or may send this information to other users. The other users may use the…
Aspects of the subject matter described herein relate to providing and restricting access to content. In aspects, information (e.g., a URL) that identifies content and a user is provided to a user. In conjunction with providing the information to a user, a data structure (e.g., an access control list) is updated to indicate that the user has access to the content. The user may use the information to access the content and/or may send this information to other users. The other users may use the information (e.g., by pasting it into a browser) to access the content and may be added to the data structure so that they may subsequently access the content without the use of the information. Access to the content via using the information may be subsequently revoked.
-
Speaker Identification
Issued United States 8,719,019
Speaker identification techniques are described. In one or more implementations, sample data is received at a computing device of one or more user utterances captured using a microphone. The sample data is processed by the computing device to identify a speaker of the one or more user utterances. The processing involving use of a feature set that includes features obtained using a filterbank having filters that space linearly at higher frequencies and logarithmically at lower frequencies…
Speaker identification techniques are described. In one or more implementations, sample data is received at a computing device of one or more user utterances captured using a microphone. The sample data is processed by the computing device to identify a speaker of the one or more user utterances. The processing involving use of a feature set that includes features obtained using a filterbank having filters that space linearly at higher frequencies and logarithmically at lower frequencies, respectively, features that model the speaker's vocal tract transfer function, and features that indicate a vibration rate of vocal folds of the speaker of the sample data.
-
Strongly typed tags
Issued United States 7,912,860
In one or more embodiments, a tag is provided and includes a property that associates a strongly typed variable with the tag. Strongly typed variables can include any suitable types. For example, in at least some embodiments, the strongly typed variable is a people type that allows the tag to be associated with an individual person or group of people by virtue of a unique identification that is associated with the person or group. Strongly typed tags can then serve as a foundation upon which…
In one or more embodiments, a tag is provided and includes a property that associates a strongly typed variable with the tag. Strongly typed variables can include any suitable types. For example, in at least some embodiments, the strongly typed variable is a people type that allows the tag to be associated with an individual person or group of people by virtue of a unique identification that is associated with the person or group. Strongly typed tags can then serve as a foundation upon which various other types of information and services can be provided to enhance the user experience.
-
Remotely accessing protected files via streaming
Issued United States 7,681,238
A source device permits a user of a remote device to access a protected file on the source device when the user of the remote device has a right to access the protected file. The user locates the protected file on the source device using the remote device and accesses the protected file using a media player on the remote device. The media player constructs a path by which the source device streams the protected file. The remote device responds to an authentication request from the source device…
A source device permits a user of a remote device to access a protected file on the source device when the user of the remote device has a right to access the protected file. The user locates the protected file on the source device using the remote device and accesses the protected file using a media player on the remote device. The media player constructs a path by which the source device streams the protected file. The remote device responds to an authentication request from the source device that the user of the remote device has a right to access the protected file. The user is authenticated to confirm that the user of the remote device has a right to access the protected file. The protected file is streamed to the remote device via a path constructed by the remote device.
-
Routing of resource information in a network
Issued United States 7,668,939
A media server in a Universal Plug and Play (UPnP) network includes a resource sharing service to govern the distribution of resource information regarding resources to rendering devices. In one case, the resource sharing service consults a criterion to determine whether an identified network device is authorized to receive resource information. In another case, the resource sharing service consults another criterion to determine whether a specified individual associated with the media server…
A media server in a Universal Plug and Play (UPnP) network includes a resource sharing service to govern the distribution of resource information regarding resources to rendering devices. In one case, the resource sharing service consults a criterion to determine whether an identified network device is authorized to receive resource information. In another case, the resource sharing service consults another criterion to determine whether a specified individual associated with the media server must consent to the transfer of the resource information in order for the transfer to occur. The resource information may include resource metadata that describes high level information regarding resources, as well as resource content. The media server includes various user interface presentations that allow the media server user to specify shared resources and distribution criteria.
-
Speech recognition analysis via identification information
Issued United States 8,676,581
Embodiments are disclosed that relate to the use of identity information to help avoid the occurrence of false positive speech recognition events in a speech recognition system. One embodiment provides a method comprising receiving speech recognition data comprising a recognized speech segment, acoustic locational data related to a location of origin of the recognized speech segment as determined via signals from the microphone array, and confidence data comprising a recognition confidence…
Embodiments are disclosed that relate to the use of identity information to help avoid the occurrence of false positive speech recognition events in a speech recognition system. One embodiment provides a method comprising receiving speech recognition data comprising a recognized speech segment, acoustic locational data related to a location of origin of the recognized speech segment as determined via signals from the microphone array, and confidence data comprising a recognition confidence value, and also receiving image data comprising visual locational information related to a location of each person in an image. The acoustic locational data is compared to the visual locational data to determine whether the recognized speech segment originated from a person in the field of view of the image sensor, and the confidence data is adjusted depending on this determination.
-
Techniques for limiting network access
Issued United States 7,647,385
A network architecture in a Universal Plug and Play (UPnP) network includes a resource sharing service to govern the distribution of resource information from a server to a recipient entity (such as rendering device or a control point). The network architecture includes one or more of the following provisions: (a) setting the server to operate in a predetermined private address range or an Auto IP range; (b) operating one or more parts of the network architecture on the same subnet; (c) setting…
A network architecture in a Universal Plug and Play (UPnP) network includes a resource sharing service to govern the distribution of resource information from a server to a recipient entity (such as rendering device or a control point). The network architecture includes one or more of the following provisions: (a) setting the server to operate in a predetermined private address range or an Auto IP range; (b) operating one or more parts of the network architecture on the same subnet; (c) setting a time to live (TTL) parameter associated with messages transmitted by the server to a predetermined number; (d) setting a number of permitted recipient entities to a predetermined number; (e) setting a number of permitted concurrent content distribution sessions to a predetermined session number; (f) granting access to a recipient entity on condition that the recipient entity has generated a message that conforms to the UPnP protocol; and (g) retiring a URL used to identify a location of a resource provided by the server after a predetermined amount of time.
-
Server architecture for network resource information routing
Issued United States 7,555,543
A media server in a Universal Plug and Play (UPnP) network includes a resource sharing service to govern the distribution of media resource information to rendering devices. The media server includes: a media service module operating in a clamped down user context (e.g., a local service user context) and configured to share resource information over the network; a supplemental module operating in a local system user context and configured to assist the media service module in sharing resource…
A media server in a Universal Plug and Play (UPnP) network includes a resource sharing service to govern the distribution of media resource information to rendering devices. The media server includes: a media service module operating in a clamped down user context (e.g., a local service user context) and configured to share resource information over the network; a supplemental module operating in a local system user context and configured to assist the media service module in sharing resource information over the network; and a control panel module operating in a logged on user context and configured to interact with a user via a user interface display. The local system user context provides a higher level of access to media server resources compared to the clamped down user context. The media server also provides fast user switching (FUS) functionality that allows multiple users to have respective instances of the control panel module pending at the same time. Further, the media server includes a mechanism to prevent rogue applications from masquerading as the control panel module and thereby gaining unauthorized access to the media service module.
-
Universal digital media communications and control system and method
Issued United States 7,420,112
A digital media communications and control system includes a plurality of audio devices each of which includes a device interface module for communication of digital media data and control data from at least one of the devices to at least one other of the devices. A universal data link is operatively connected to each of the device interface modules. The device interface modules and universal data links are operative in combination to connect the devices together in the system and provide full…
A digital media communications and control system includes a plurality of audio devices each of which includes a device interface module for communication of digital media data and control data from at least one of the devices to at least one other of the devices. A universal data link is operatively connected to each of the device interface modules. The device interface modules and universal data links are operative in combination to connect the devices together in the system and provide full duplex communication of the digital media data and control data between the devices.
-
Universal digital media communications and control system and method
Issued United States 6,686,530
A digital media communications and control system includes a plurality of audio devices each of which includes a device interface module for communication of digital media data and control data from at least one of the devices to at least one other of the devices. A universal data link is operatively connected to each of the device interface modules. The device interface modules and universal data links are operative in combination to connect the devices together in the system and provide full…
A digital media communications and control system includes a plurality of audio devices each of which includes a device interface module for communication of digital media data and control data from at least one of the devices to at least one other of the devices. A universal data link is operatively connected to each of the device interface modules. The device interface modules and universal data links are operative in combination to connect the devices together in the system and provide full duplex communication of the digital media data and control data between the devices.
-
Apparatus and method for De-esser using adaptive filtering algorithms
Issued United States 6,373,953
A method and apparatus for the real-time creation of an output audio signal from an input signal with an unwanted or noise portion. The system detects the unwanted portion of the input signal by utilizing an adaptive detection filter and reduces the unwanted portion of the input signal. The reduction of the unwanted portion is performed by compression of the unwanted signal, subtraction of the unwanted portion of the signal, or eliminating the output signal until the unwanted portion is no…
A method and apparatus for the real-time creation of an output audio signal from an input signal with an unwanted or noise portion. The system detects the unwanted portion of the input signal by utilizing an adaptive detection filter and reduces the unwanted portion of the input signal. The reduction of the unwanted portion is performed by compression of the unwanted signal, subtraction of the unwanted portion of the signal, or eliminating the output signal until the unwanted portion is no longer detected. The system is specifically designed to find a high frequency and high amplitude sound such as a sibilant.
-
Universal audio communications and control system and method
Issued United States 6,353,169
An audio communications and control system includes a plurality of audio devices each of which includes a device interface module for communication of digital audio data and control data from at least one of the devices to at least one other of the devices. A universal data link is operatively connected to each of the device interface modules. The device interface modules and universal data links are operative in combination to connect the devices together in the system and provide full duplex…
An audio communications and control system includes a plurality of audio devices each of which includes a device interface module for communication of digital audio data and control data from at least one of the devices to at least one other of the devices. A universal data link is operatively connected to each of the device interface modules. The device interface modules and universal data links are operative in combination to connect the devices together in the system and provide full duplex communication of the digital audio data and control data between the devices.
-
SYSTEM AND METHOD FOR ANALYZING AND CLASSIFYING CALLS WITHOUT TRANSCRIPTION VIA KEYWORD SPOTTING
Filed United States
A facility and method for analyzing and classifying calls without transcription via keyword spotting is disclosed. The facility uses a group of calls having known outcomes to generate one or more domain- or entity-specific grammars containing keywords and related information that are indicative of particular outcome. The facility monitors telephone calls by determining the domain or entity associated with the call, loading the appropriate grammar or grammars associated with the determined…
A facility and method for analyzing and classifying calls without transcription via keyword spotting is disclosed. The facility uses a group of calls having known outcomes to generate one or more domain- or entity-specific grammars containing keywords and related information that are indicative of particular outcome. The facility monitors telephone calls by determining the domain or entity associated with the call, loading the appropriate grammar or grammars associated with the determined domain or entity, and tracking keywords contained in the loaded grammar or grammars that are spoken during the monitored call, along with additional information. The facility performs a statistical analysis on the tracked keywords and additional information to determine a classification for the monitored telephone call.
-
SYSTEM AND METHOD FOR ANALYZING AND CLASSIFYING CALLS WITHOUT TRANSCRIPTION
Filed United States
A facility and method for analyzing and classifying calls without transcription. The facility analyzes individual frames of an audio to identify speech and measure the amount of time spent in speech for each channel (e.g., caller channel, agent channel). Additional telephony metrics such as R-factor or MOS score and other metadata may be factored in as audio analysis inputs. The facility then analyzes the frames together as a whole and formulates a clustered-frame representation of a…
A facility and method for analyzing and classifying calls without transcription. The facility analyzes individual frames of an audio to identify speech and measure the amount of time spent in speech for each channel (e.g., caller channel, agent channel). Additional telephony metrics such as R-factor or MOS score and other metadata may be factored in as audio analysis inputs. The facility then analyzes the frames together as a whole and formulates a clustered-frame representation of a conversation to further identify dialogue patterns and characterize call classification. Based on the data in the clustered-frame representation, the facility is able to make estimations of call classification. The correlation of dialogue patterns to call classification may be utilized to develop targeted solutions for call classification issues, target certain advertising channels over others, evaluate advertising placements at scale, score callers, and to identify spammers.
-
SYSTEM AND METHOD FOR HIGH-PRECISION 3-DIMENSIONAL AUDIO FOR AUGMENTED REALITY
Filed United States
Techniques are provided for providing 3D audio, which may be used in augmented reality. A 3D audio signal may be generated based on sensor data collected from the actual room in which the listener is located and the actual position of the listener in the room. The 3D audio signal may include a number of components that are determined based on the collected sensor data and the listener's location. For example, a number of (virtual) sound paths between a virtual sound source and the listener may…
Techniques are provided for providing 3D audio, which may be used in augmented reality. A 3D audio signal may be generated based on sensor data collected from the actual room in which the listener is located and the actual position of the listener in the room. The 3D audio signal may include a number of components that are determined based on the collected sensor data and the listener's location. For example, a number of (virtual) sound paths between a virtual sound source and the listener may be determined The sensor data may be used to estimate materials in the room, such that the affect that those materials would have on sound as it travels along the paths can be determined In some embodiments, sensor data may be used to collect physical characteristics of the listener such that a suitable HRTF may be determined from a library of HRTFs.
-
SPEECH RECOGNITION ANALYSIS VIA IDENTIFICATION INFORMATION
Filed United States
Embodiments are disclosed that relate to the use of identity information to help avoid the occurrence of false positive speech recognition events in a speech recognition system. One embodiment provides a method comprising receiving speech recognition data comprising a recognized speech segment, acoustic locational data related to a location of origin of the recognized speech segment as determined via signals from the microphone array, and confidence data comprising a recognition confidence…
Embodiments are disclosed that relate to the use of identity information to help avoid the occurrence of false positive speech recognition events in a speech recognition system. One embodiment provides a method comprising receiving speech recognition data comprising a recognized speech segment, acoustic locational data related to a location of origin of the recognized speech segment as determined via signals from the microphone array, and confidence data comprising a recognition confidence value, and also receiving image data comprising visual locational information related to a location of each person in an image. The acoustic locational data is compared to the visual locational data to determine whether the recognized speech segment originated from a person in the field of view of the image sensor, and the confidence data is adjusted depending on this determination.
-
DISTRIBUTED ASYNCHRONOUS LOCALIZATION AND MAPPING FOR AUGMENTED REALITY
Filed United States
A system and method for providing an augmented reality environment in which the environmental mapping process is decoupled from the localization processes performed by one or more mobile devices is described. In some embodiments, an augmented reality system includes a mapping system with independent sensing devices for mapping a particular real-world environment and one or more mobile devices. Each of the one or more mobile devices utilizes a separate asynchronous computing pipeline for…
A system and method for providing an augmented reality environment in which the environmental mapping process is decoupled from the localization processes performed by one or more mobile devices is described. In some embodiments, an augmented reality system includes a mapping system with independent sensing devices for mapping a particular real-world environment and one or more mobile devices. Each of the one or more mobile devices utilizes a separate asynchronous computing pipeline for localizing the mobile device and rendering virtual objects from a point of view of the mobile device. This distributed approach provides an efficient way for supporting mapping and localization processes for a large number of mobile devices, which are typically constrained by form factor and battery life limitations.
-
SEMI-PRIVATE COMMUNICATION IN OPEN ENVIRONMENTS
Filed United States
A system and method providing semi-private conversation using an area microphone between one local user in a group of local users and a remote user. The local and remote users may be in different physical environments, using devices coupled by a network. A conversational relationship is defined between a local user and a remote user. The local user's voice is isolated from other voices in the environment, and transmitted to the remote user. Directional output technology may be used to direct…
A system and method providing semi-private conversation using an area microphone between one local user in a group of local users and a remote user. The local and remote users may be in different physical environments, using devices coupled by a network. A conversational relationship is defined between a local user and a remote user. The local user's voice is isolated from other voices in the environment, and transmitted to the remote user. Directional output technology may be used to direct the local user's utterances to the remote user in the remote environment.
Organizations
-
IEEE
-
-
Audio Engineering Society (AES)
Full Member
Recommendations received
11 people have recommended Jason Join now to view
More activity by Jason
Anyone remember this amazing magazine? #audio #audioengineering #sound #soundengineering
Liked by Jason Flaks
Since starting with CA College Corp last year, Learn To Be tutors have hosted 4,448 hours of tutoring for ~270 students across California. Not…
Liked by Jason Flaks
I'll be speaking at Microsoft Build in just 2 weeks! Join me as I showcase Docker's latest innovations to help developers be more productive when…
Liked by Jason Flaks
Cobus Greyling I just wrote a bit about this. While true that LLMs can replace many traditional chatbot components, any one using it at scale will…
Shared by Jason Flaks
AI is transforming almost every industry. Zocks has built an assistant for Financial Advisors that is taking that industry by storm, automating all…
Liked by Jason Flaks
People also viewed
Explore collaborative articles
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
Explore More