Questions and answers
1. Can I program by voice?
1.1. Can I program primarily or entirely by voice? That is, using speech recognition rather than a keyboard or mouse?
Yes. Many people do this including people paralyzed from the neck down. Larry Allen, of pcspeak.com, estimates that there are 100 full-time programmers doing essentially all programming tasks by voice with a further 1,000 to 10,000 programmers doing some programming tasks by voice.
1.2. Who programs entirely or mostly by voice?
Today mostly people with a disability that rules out using a keyboard — e.g., carpal tunnel, wrist tendinitis, and other forms of repetitive strain injury (RSI) — and people that have recovered from such disabilities. Uninjured people are put off by the effort required to get programming by voice working even though it is faster for many tasks.
1.3. What if I just want to type and click less?
Although this FAQ focuses on how to avoid using your hands at all or more than a small amount per day, you can use simplified versions of these techniques to type and click a lot less.
To start, you can dictate emails, documentation, and other English text; browse the web via voice commands; and use voice commands to click the mouse. Clicking the mouse can also be done by using a foot switch or "auto click", where software automatically clicks the mouse if it remains in one place long enough.
1.4. What if my main language isn't English?
Dragon NaturallySpeaking supports Dutch, French, German, Italian, Spanish, and Japanese whereas traditional and simplified Chinese are supported by Windows Speech Recognition. If you speak one of these languages, you can dictate to and control your computer in that language. That said, English is used in most of the relevant forums and third-party documentation, so it helps to understand at least some English.
1.5. I am a disabled college student who cannot type. Can I become a programmer?
Yes. Master dictating and the basic built-in voice commands of your speech recognition program, preferably first. You may need a scribe or other assistance for your initial programming classes. Later, self-paced classes or classes with group projects may be good choices.
If you struggle in your initial classes, you must decide whether you struggle because you cannot use a keyboard or because programming is not for you. Not everyone is cut out to be a programmer — being a successful programmer requires an unusual set of talents, and many bright people have failed at programming.
2. How can programming by voice possibly work?
2.1. Isn't it hard to enter code that contains a lot of punctuation like 'input := open("foo.txt");'?
To handle punctuation, programmers can add special vocabulary words (e.g., "colon equals" for ":=", "left paren" for "(" with no preceding space), use code templates, or write editor code that translates more English-like language into actual code. Even with these methods, it may take somewhat longer to enter code than English. This isn't a big deal, however, because programmers, in practice, don't spend much time entering new code.
2.2. How do programmers enter unpronounceable "words" like strncpy, cosh, or ostreambuf_iterator?
When these words occur frequently, programmers just create new vocabulary words with English spoken names (e.g., "D fun" for defun and "short square root" for sqrt). Unfortunately, this method does not work for rare words because no one can remember many rarely used names. Programmers instead use more complicated methods for these words, which are usually program symbols:
New unpronounceable symbols are usually either spelled (e.g., "spell c o s h") or produced by entering then modifying an existing symbol. Existing symbols can be entered by several methods: Programmers can copy a visible symbol (e.g., insert the first symbol on line 10), or they can repeat a recent symbol by choosing from a numbered list of recently entered symbols. Alternatively, they can specify an existing symbol by any English words that it abbreviates (e.g., "output stream buffer iterator" for ostreambuf_iterator). Finally, programmers can use IDE completion by partially spelling out symbols or choosing completions from a list.
By comparison, symbols made from pronounceable words like camel-case symbols (e.g., RelayClientDisconnected) are easy to enter. A simple voice command can join the component English words or editor code can do so after the words are dictated.
2.3. What are some tasks that programming by voice is bad at?
People find selecting or positioning unlabeled, graphical elements — think creating PowerPoint diagrams or using GUI creators — by voice alone frustratingly slow. They find it easy, however, to select elements that have been labeled (e.g., Mouseless Browsing's labeled HTML links).
When elements cannot be labeled, people often supplement speech recognition with mouse replacements like head or eye trackers. Touchscreens or pen-based tablets may also work for some people if not used too much per day.
2.4. What are some tasks that programming by voice is faster for?
Most people speak much faster than they can type, so anything involving writing down English (e.g., email, documentation, comments) is going to be faster by voice. The average typist types 40 words per minute (WPM) but could dictate at over 150 WPM with some practice — almost 4 times faster.
Routine tasks can also be faster because you can use more shortcuts with voice. Voice allows for more usable shortcuts because people remember named shortcuts far better than keystroke combinations. Thousands of voice shortcuts are common. Many people, for example, make a separate voice shortcut to move to each frequently used directory and source file.
3. Why is it so hard to get programming by voice working?
3.1. How long does it take to start programming by voice?
Peoples' experience varies, but getting a basic system working seems to take a small number of months. For example, Tavis Rudd estimates it took him three months to get productive again. A more complete system can take a year.
3.2. Why is programming by voice hard to learn / build systems for?
First, a specialized language must be created and learned. Editing text (programming code or otherwise) involves many small operations that need concise descriptions to be efficient. Consider by analogy how a waitress communicates with a short order cook: she doesn't use "natural language" like "I need two grilled cheese sandwiches, one with fries"; rather, she says something concise like "two cheese one fry". A sample editing command might be "go 14 leap semicolon erase", meaning go to the start of the visible line whose line number's last two digits are 14, move the cursor forward to the next semicolon, then erase 1 character (namely, the semicolon).
Second, a large number of varied commands needs to be created. Editing and code dictation commands alone are not enough; commands are also needed for browsing code, manipulating files, using version control, compiling, debugging, reviewing code, emailing, and browsing the web.
Third, unless the applications the programmer wants to use are especially keyboard friendly, they will need to be extended to support voice control. Extending can be done by using a extension language (e.g., elisp), by writing a plug-in, or by modifying the source code. Extension is often needed so you can easily verbally pick an item out of a list — the method of "hold down an arrow key until you reach the item you want" is far too slow when done by voice. Better methods include visually labeling items so the programmer can say the desired item's label and picking items by saying words they contain. The later method may require extracting the names of the items from the application at runtime if the set of items varies.
3.3. Why are there no ready-made systems for programming by voice?
We believe no company has attempted to build one of these because the market is too small: There are many fewer disabled programmers than lawyers and doctors who could benefit from speech recognition. Moreover, programmers use hundreds, if not thousands, of incompatible languages, editors, and IDEs. This means any one programming-by-voice system can capture only a fraction of the disabled programmer market. Even the larger market for helping the disabled in general control their computers by voice attracts little commercial attention. These market realities are why speech recognition vendors emphasize ease of learning and natural language commands (e.g., "move down 2 lines") rather than ease of use (e.g., "down 2").
Open source, by contrast, is mostly driven by programmers with an itch—in this case, disabled programmers trying to program by voice. Unfortunately, disabled programmers are usually too busy trying to do their day jobs and get/keep a programming-by-voice system working for themselves to have much time to contribute to open source. They do not have time to convert a system that only works for them to a portable, reusable system. Because of these constraints and the fragmentation of programming environments, most of the open source to date has been around tools for building voice commands.
An additional complication is that programming-by-voice systems are most effective when they match the cognitive style and naming conventions of their users (e.g., it's harder to remember command names chosen by other people); this further fragments the community's efforts.
There is one quasi-exception to no open source system being available, namely VoiceCode. VoiceCode is an open source system for Emacs that supports many popular programming languages. Unfortunately, it appears to be living-dead software, slowly rotting, and without any maintainers. As of July 2017, it does not work with the 2012+ versions of the associated speech recognizer.
3.4. Are some languages/editors/IDEs/domains easier than others to program by voice with?
Yes. For languages, ones like Java and Ruby that by convention use names made out of English words are easier than languages like C++ that encourage unpronounceable names. For editors and IDEs, ease of extensibility makes things easier. For domains, command-line-only programs are easier to create by voice than GUIs or web applications because graphic elements need not be manipulated.
Gnuemacs stands out as especially easy to use for programming by voice: it provides excellent extensibility via elisp as well as packages for manipulating files, using version control, compiling source code, and the like. Moreover, all of this functionality can be used without a mouse.
4. How can I get started programming by voice?
4.1. What's the best speech recognizer for programming by voice?
There are only two recognizers usable today for serious programming by voice. Best is Dragon NaturallySpeaking (Dragon) — recently renamed to Dragon Professional Individual — by Nuance. Second best is Windows Speech Recognition (WSR), which is included in Microsoft's recent operating systems. Although WSR is free unlike Dragon, more than one voice coder has given up on it due to its reduced accuracy, particularly with noise words.
Both of these recognizers run under only Windows. They may, however, be used under Linux or Mac OS by running them in a virtual machine. They can also be used with Linux or Macs by using a Windows PC as a "smart terminal": the programmer sits at the Windows PC and SSH's into one or more Linux/Mac OS boxes and pops up xterms, Emacs, and the like via X11 on the Windows PC. These approaches forgo some functionality: Dragon and WSR provide built-in commands for controlling applications written using the standard Windows graphical toolkit and text controls; Dragon also provides natural language commands for Microsoft Office and Internet Explorer. None of this functionality is available for other operating systems.
DragonDictate for the Mac, also made by Nuance, is good for dictation but its command and control functionality is way behind that of Dragon. Siri and similar programs likewise do well at dictation but poorly at commands.
Dragon comes in several versions; you want Dragon Professional Individual 14 or the slightly older Dragon NaturallySpeaking premium 13. The most recent version, DPI 15, should be avoided as it has a bug that makes it unusable with many of the best programming-by-voice tools (including NatLink, Vocola, and DragonFly). There doesn't appear to be a compelling reason to choose the more expensive DPI 14 over DNS 13. The older DNS 12 premium/professional actually works better for programming by voice, but is no longer available.
4.2. What's the best hardware for programming by voice?
Although Dragon NaturallySpeaking works okay with most recent PCs, it works best with a fast processor (e.g., Intel core i7, 3+ GHz) and lots of RAM (8 GB or more). If you will be working in a noisy area, you will want to invest in a high quality noise-canceling microphone. There are many styles of microphones with different people preferring different styles, so we'll just refer you here to some vendor guides:
4.3. What's the best software for creating voice commands?
Vocola and/or DragonFly are your best choice. Both are open source and work with both Dragon and Windows Speech Recognition (WSR).
Vocola provides a concise language for writing voice commands; here are four example commands:
Copy That = {Ctrl+c};
Copy to WordPad = {Ctrl+a}{Ctrl+c} AppBringUp(WordPad);
1..40 (Left | Right | Up | Down) = {$2_$1};
Sort by (Date=e | Sender=n | Subject=s) = {Alt+v}o $1;
Vocola supports user-defined functions, optional command parts, arbitrary text parts, multiple commands in a single utterance, and easy-to-write extensions written in Python (Dragon version) or .Net (WSR version).
DragonFly, by contrast, is a Python programming framework for creating voice commands. Dragonfly's commands are less concise than Vocola's but can do pretty much anything possible with the speech recognition engines. Vocola's standard functionality probably suffices for 95-99% of the voice commands a programmer wants, so we recommend starting with Vocola and only later using DragonFly or custom Vocola extensions for the commands that need more power.
4.4. Are there any demos I can watch?
Here are some demos that you may find inspiring. The demonstrated systems vary in maturity and effectiveness. None shows all the most effective known techniques. For example, Tavis Rudd doesn't demonstrate the use of line numbers and pauses more than necessary. Also, some of the speakers are intentionally speaking slowly to make their demos easier to follow or are using now-decade-old hardware and software, which cannot keep up in real time.
An excellent 2013 demo given by Tavis Rudd at PyCon, entitled "Using Python to Code by Voice". Tavis says that in practice he uses line numbers heavily and goes much faster, in part by using shortcuts he didn't show for fear of angering the demo gods.
A follow-up demo on VimSpeak by Ashley Feniello; note that he does not appear to have speech recognition set up optimally, and thus has a much higher misrecognition rate than expected.
The original VoiceCode demo (the audio is missing in the beginning)
Nils Klarlund's demo of his ShortTalk system, which demonstrates power of stringing together many small editing commands
Jason Veldicott's demo, which uses approximate string matching for dictating existing symbols
4.5. Where can I ask questions?
For general speech recognition questions (e.g., "What's a good microphone?" or "How do I select cells in Excel?"), the best current forums are at KnowBrainer. The Dragon NaturallySpeaking Speech Recognition forum is particularly good. Similar forums in German can be found here.
For programming-specific questions, the voice-coders mailing list is probably your best bet.
4.6. Are there any other helpful resources for building a programming by voice system?
Here are some other useful links:
The Vocola website
James' blog on using DragonFly to program by voice. David Conway's series of articles introducing voice programming is also good.
Mark's sample editing commands in Vocola for Win32Pad that demonstrate combining multiple concise commands into a single utterance, using line numbers mod 100, moving by punctuation, and operating on line ranges specified by line number. Mark uses clever tricks to provide this functionality for an underpowered editor; likely, some variant of these tricks can be used with many programming editors.
The NatLink website, the unofficial extension of Dragon that allows creating voice commands using a kludgy Python interface. This extension is used by both Vocola and DragonFly to provide smoother ways of creating commands. A paper, "Implementation and Acceptance of NatLink, a Python-Based Macro System for Dragon NaturallySpeaking", and a PowerPoint, "NatLink: A Python Macro System for Dragon NaturallySpeaking", are available.
Unimacro, a collection of NatLink grammars by the current maintainer of NatLink
The VoiceCode project website such as it is and the academic paper, "VoiceCode: an Innovative Speech Interface for Programming-by-Voice", describing the VoiceCode system
The Mouseless Browsing extension for Firefox, which labels HTML links with numbers and allows them to be selected by typing those numbers; this allows links to be chosen by voice. The new Click by Voice extension provides equivalent functionality for Chrome.
Freesr Speech Recognition and SpeechStart+, third-party utilities that among other things label widgets, windows, and taskbar items and icons for manipulation by provided voice commands. SpeechStart+ also allows restarting Dragon NaturallySpeaking by voice when it gets stuck, which is handy for completely hands-free users.
NaturalPoint's SmartNav 4, a reliable and accurate, hands-free mouse alternative that allows complete control of a computer by naturally moving the head.
AutoHotkey, a free, open-source keyboard-macro-creation and automation program. The AutoHotkey community is a useful source of expertise on automating Windows applications; AutoHotkey macros can be invoked from voice macros using
SendSystemKeys
orSendInput
.
5. Disclaimers and licensing
All contributors' employers will no doubt disown any statements herein. We are not speaking for anyone but ourselves.
Every effort has been made to produce an accurate and useful document, but the information herein is completely without warranty. If you find any errors or otherwise wish to contribute, please contact Mark.
This document may be freely distributed as long as it is not modified. An ASCII version is also available.