Open Yoda Corpus

The language of one of the Star Wars franchise’s most enigmatic and powerful characters, the tiny green Jedi Master Yoda, has attracted a fair bit of attention from linguists due to its idiosyncrasies. A full review and bibliography of Yoda linguistics is beyond the scope of this blog post, but see for instance here, here and here.

Most people agree on the basic facts: what Yoda really likes to do is take a predicate or a verb phrase and stick it at the start of the clause. However, there is variation, and interesting nuances to be explored. As a service to the Yoda linguistics community I’ve collected all of Yoda’s lines from the movie and made them available here. Do what you want with them! (Of course, I make no claim to own Yoda or any of his utterances – more’s the pity.)

The format is tab-separated, with the line itself in the first column and a code for what movie it’s from in the second column. The sources are, in the following order:

  • Empire Strikes Back (ESB): own watching and transcription
  • Return of the Jedi (RJ): own watching and transcription
  • Phantom Menace (PM): this script
  • Attack of the Clones (AC): this script
  • Revenge of the Sith (RS): this script
  • The Last Jedi (LJ): own watching and transcription, plus this link

With the prequel scripts, I’ve made some slight editorial tweaks to fix obvious typos and weird punctuation, but otherwise remained faithful. Yoda doesn’t appear in A New Hope or The Force Awakens.

The corpus itself can be found here.

Featured image: Yoda statue in California; photo from Wikimedia Commons, by GPS (CC-BY-SA 2.0).


TROLLing: new open data archive

Linguists at the University of Tromsø have released a new repository for language and linguistic data, which is fully open access.

From the archive’s About page:

The Tromsø Repository of Language and Linguistics (TROLLing) is designed as an archive of linguistic data and statistical code. The archive is open access, which means that all information is available to to everyone. All postings are accompanied by searchable metadata that identify the researchers, the languages and linguistic phenomena involved, the statistical methods applied, and scholarly publications based on the data (where relevant).

Linguists worldwide are invited to post datasets and statistical models used in linguistic research. The TROLLing Steering Committee is responsible for the scientific content of the archive, whereas the University Library provides quality and relevance control, in addition to user management. The University Library also oversees the technical and legal structure of TROLLing.

You can visit the archive here. There’s also an amusing promotional video: