How we found a million style and grammar errors in the English wikipedia – Daniel Naber

LanguageTool is a Java tool that analyses (xml) text for spelling and grammatical errors. It was run on 20K Wikipedia articles, which resulted in 37K errors (grammar and style, not simple spell checker). Projection to whole wikipedia (4.4M articles): 8M potential errors. Many false alarms though, because of mediawiki syntax, non-English names, symbols used in text (e.g. “The value of n for a given a is called” => ‘a’ will produce an error; “68000 assembler” => assemblers; “Score voting and Majority Judgement allow the voters…” => allows). What is not detected: “Tomorrow, I go shopping”; “I made a concerted effort”.

The goal of LanguageTool is to be the next step after spell checking. It is available on the web and as an extension in LibreOffice and Firefox.

Error detection is based on patters, specified in XML. So it’s not necessary to program to contribute to it. Patterns support regexes (always matching one word) but also inflections of words (plural, conjugations) and word types (verb, adverb, …). A rule also contains a number of examples of correct and incorrect cases, which makes it possible to understand the rule but also to test it.

Currently there is support for 29 languages. Most patterns in French, German, Catalan; English is only 5th place with 1/3 of the French rules.

Is pattern matching enough? Isn’t something more powerful needed for natural language? Probably yes, but it would be very very specific to a single language. Why not use machine learning? You’d need a large corpus of errors and correct sentences to train it. But actually, it could be added to LanguageTool, just write your own rule in Java. OpenNLP is already used for chunking.

To improve the Wikipedia error, there is a website where you can mark errors as false alarms or go directly to the edit page to fix them (LanguageTool will preload it with the corrected text). Also, there’s a check on the RecentChanges (only on the changes, not the whole article), so that at least things improve rather than becoming worse.

LanguageTool is in good shape to extending the ubiquitous spell checker to a style and grammar checker. It’s stable, has reasonable support for many languages. However, it’s written in Java which makes it difficult to integrate in other applications. Idea: compile to to JavaScript with LLVM – but that currently doesn’t work…

How to help: add support for other languages, or become maintainer for one of the existing languages.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s