To test the validity of word counting as a solution to the problem, we identified a set of 50 gene sequences from plants (M. truncatula and G. max), Oomycetes (Phytophthora), Zygomycetes (Glomus versiforme), and bacteria (Sinorhizobium meliloti and Agrobacterium tumefaciens), for which the function and origin have been characterized experimentally. We chose genes known to play a role in plant-microbe interactions, as well as genes that occur across taxa. We withheld these sequences, and partial transcripts of the same genes, from training sets prior to comparative lexical analysis, and calculated hexamer dissimilarities for each of the three training sets as described in Section 2.3.5.