I have a question related to "audio mining", the process of analyzing audio clips for the purpose of searching and other processing. I see references on the web (eg, this and this) that suggest that the technology exists and is in use, but only on large (and expensive) scales.
Most of these systems allow you search the audio database with text queries, which requires the database to have undergone speech-to-text analysis. Speech-to-text, particularly for arbitrary (and unfamiliar) speakers is incredibly difficult, and so it's not surprising that these systems are not mainstream or cheap yet.
But I'm interested in a subset of this problem: single-speaker, speech-only search. That is, I record a bunch of audio clips of me speaking, and then I submit a query--in the form of spoken words. So it's only one speaker, and the system never has to interpret the audio as text.
I know this simpler problem is not trivial, but it seems like it should be much simpler than the commercial systems I've read about. And I can't see any examples of this simpler system.