This project has moved and is read-only. For the latest updates, please go here.

A word of warning

This is very boring, very time consuming work. It will take you upwards of 40 hours work to create a dictionary with enough words to be usable (I speak from experience).

You must select every word by hand, enter XML for every form that word has in english (plurals, tenses, etc), and then correct all the mistakes you made.

You cannot just grab a word list from a password cracker and automatically import it and expect good (or even bad) results.

How to make your own dictionary

(Note: there is now an alternate way to make a custom dictionary using a Pluggable Dictionary Loader)

OK, if I haven't scared you away yet, here's how the dictionary works.

It's a pretty simple XML file which lists words according to parts of speech (noun, verb, etc) with each different form (plural and tense). The file can be plain UTF8 XML or UTF8 XML compressed with gzip. By default, the generator will look for a dictionary.xml (or .gz) in the current working directory or assembly entry point.

Each part of speech is a single XML element and represented as a Word object in code. The main ones you'll be interested in are nouns, adjectives, verbs and adverbs. Each of these are stored in a separate file in the code base and joined back together at compile time (its easier to append words that way).

The best way to understand the dictionary is to see it.

  <article definite="the" indefinite="a" indefiniteBeforeVowel="an" />
  <demonstrative singular="this" plural="these" />
  <demonstrative singular="that" plural="those" />
  <personalPronoun singular="my" plural="our" />
  <personalPronoun singular="your" plural="your" />
  <preposition value="above"/>
  <preposition value="across from"/>
  <noun singular="waterway" plural="waterways"/>
  <noun plural="grog"/>
  <noun plural="pliers"/>
  <adjective value="downsized"/>
  <adverb value="ominously"/>
  <verb presentSingular="concocts" 
        pastSingular="concocted" 
        pastContinuousSingular="was concocting" 
        futureSingular="will concoct" 
        continuousSingular="is concocting" 
        perfectSingular="has concocted" 
        subjunctiveSingular="might concoct"
        presentPlural="concoct" 
        pastPlural="concocted" 
        pastContinuousPlural="were concocting" 
        futurePlural="will concoct" 
        continuousPlural="are concocting" 
        perfectPlural="have concocted" 
        subjunctivePlural="might concoct"/>


As you can see, pronouns, demonstratives and nouns come in singular and plural forms. Nouns can come in either or both. Adjectives and adverbs have only a single form. And verbs are just horridly complicated (with up to 7 tense forms and singular / plural). Verbs should at least have present, past and future tenses (but the more forms the more combinations).

That's all there is to it!

Now you just need 5000 of them.

Last edited Jul 28, 2012 at 2:22 PM by ligos, version 3

Comments

VWFeature Sep 15, 2014 at 10:42 PM 
Folks trying to generate dictionaries might be able to use this POS tagger to get a start.
http://www.nltk.org/book/ch05.html

http://www.nltk.org/howto/generate.html

There's also this:
http://en.wikipedia.org/wiki/Part-of-speech_tagging
which references SEVERAL powerful tools like http://yatsko.zohosites.com/cll-tagger.html
http://yatsko.zohosites.com/contact.html
"Part-of speech tagging has been widely used in corpus linguistics and in the last decades has become an indispensable component in such fields as text mining and text classification/categorization. Application of tagging in these fields faces one major problem: it is a time consuming procedure...(but this program does it REALLY fast!)" WOW!!

It looks like these tools can take a large body of text (like something from The Gutenberg Project) and generate a dictionary/corpus separated into POS to use in generating passphrases.

There's been lots of very smart people working on this for a very long time! I don't pretend to understand all this, but I hope this info could help someone who does. I just make connections....Have fun!!

http://www.grammaticalframework.org/doc/gf-quickstart.html
http://www.cis.upenn.edu/~xtag/
http://attempto.ifi.uzh.ch/site/description/

PerplexterKot Feb 23, 2014 at 2:49 AM 
(2 years later)
I'd like to create a dictionary for german language, but it seems, that the XML-tags (or better: the algorithm in the sourcecode) will limit the application for german language/grammar (or even make it impossible).
For instance, german language has 3 articles (male/female/neuter) which I can't see to put in the XML.

This would be a better project for a skilled programmer, which -unfortunately- I'm not. So I just could spend time on creating a dictionary.

If there is anybody out there who can and would help to create such a nice plugin for german KeePass users, just send a message.

Anyways, regards and respect to ligos! Thank you.

n1LWEb Oct 26, 2012 at 9:48 AM 
Did anyone take the time to create a german dictionary?

RBNK Oct 8, 2012 at 4:29 AM 
Don't see any sense making my own dictionary when I can just add to yours.