Examples

Language detection

Spam filter

Language detection

Let's write a simple language detector. It will recognize german and english texts. But you can add more languages with ease.

We will need the NaiveText gem, so go ahead and install it: gem install NaiveText We will also need some examples text from both languages. So create a directory training with two subdirectories training/german and training/english. Now we put some trainings material into those directories. Normally some texts will do. But I decided to use a list of commonly used words for each language. You can find it together with the code here.
```
require "NaiveText"

class LanguageDetector
  def initialize
    german_examples = ExamplesFactory.from_files('training/german')
    english_examples = ExamplesFactory.from_files('training/english')

    categories_config = [{name: 'german', examples: german_examples },
                         {name: 'english', examples: swedish_examples},
                        ]

    classifier = NaiveText.build(categories: categories_config)
  end
end
```
The above Code does two things:
1. It loads our examples (saved previously).
2. It builds a classifier.
To build a simple classifier you only need to specify an array of 'categories', consisting of a name for the category and some text examples as an orientation for the classifier

Next we will write to method to interface the classifier:
```
require "NaiveText"

class LanguageDetector
  def initialize
      ... #like above
  end

  def get_language(args)
    puts classifier.classify(args[:text]).name
  end

  def probabilities(args)
     classifier.probabilities(args[:text])
  end
end
```
The first function get_language takes an text as a keyword argument and returns the corresponding language as specified in our categories array. Note that classifier.classify return an catgeory, so you need to call name on it to access the language name as a string.

The second function probabilities also takes an text as a keyword argument but it returns an collection of probabilities. You can print it using puts (example below).

Now we can use our Language detector:
```
... # Code from above

detector = LanguageDetector.new

puts detector.get_language(text: 'This is obviously an english text') # ==> english
puts detector.get_language(text: 'Der Text ist deutsch')              # ==> german

multi_language = 'This is an english text containing just one german word "Herren"
meaning gentleman. Lets see what happens.'
puts detector.probabilities(text: multi_language)
```
The full source of the example can be found here.
Spam filter

In this example we will write a simple spam filter integrated in a rails app which lists user posts. It shows you how you can reduce spam in user created rails models.

So we begin by creating a simple rails app.
```
$ rails new SpamFree
```
And scaffold the posts resource, which contains one field for the posts text and one for the content_type. The content_type of a post is one of the following strings:

'spam', 'content' or 'verified_content'

and will be set by our spam filter. As you will assume models with a content_type of 'spam' are spam and aren't accessible by the user. The difference between 'verified_content' and 'content' is, that veriefied_content is checked by you (or a moderator) and is guaranteed to contain no spam. It is used to train the spam filter.

So go ahead an scaffold your rails app:
```
$ rails g scaffold Post text:string content_type:string
$ rake db:migrate
```
We will need the NaiveText gem, so go ahead and put it in your Gemfile:
```
gem 'NaiveText'
```
Make sure to run bundler:
```
$ bundle
```
Also be sure to set the root route to 'posts#index'

For easier querying we add three scope to the Post model (learn more about scopes in the rails guides).
```
class Post < ActiveRecord::Base
  scope :spam, ->{ where(content_type: 'spam') }
  scope :content, ->{ where.not(content_type: 'spam') }
  scope :verified, ->{ where(content_type: 'verified_content') }
end
```
The actual spam filter is pretty simple. (It is actually a singleton ruby class with one public method):
```
class SpamFilter
  def self.spam?(post)
    self.build_classifer
    @classifier.classify(post.text).name == 'spam'
  end

  private
    def self.build_classifer
      categories_config = [{name: 'spam', examples: Post.spam, weight: 1},
                           {name: 'content', examples: Post.verified, weight: 10}]
      @classifier = NaiveText.build(categories: categories_config, default: 'content')
    end
end
```
To integrate the spam filter in our rails app. We need to to make some adjustments to the PostsController. At first we want to only list posts which are not categorized as spam. So in the index action we replace Post.all with Post.content.
```
class PostsController < ApplicationController
  def index
    # Changed Post.all to Post.content
    @posts = Post.content
  end
  ...
```
Same for our set_post method.
```
    # Changed Post.find to Post.content.find
    def set_post
      @post = Post.content.find(params[:id])
    end
```
We also need to remove the content_type from our permitted parameters. This way no user can submit a post and send a content_type along.
```
    # Remove content_type
    def post_params
      params.require(:post).permit(:text)
    end
```
The real filtering happens at the controllers create action: We simply ste the posts content_type based on the response of SpamFilter.spam?.
```
  def create
    @post = Post.new(post_params)
    if SpamFilter.spam?(@post)
      @post.content_type = 'spam'
    else
      @post.content_type = 'content'
    end
    respond_to
      ...
  end
```
Before we can use our app, we need to make sure to put some example posts in our seed file
```
Post.create([{text: 'Hello I am a friendly post', content_type: 'verified_content'},
             {text: 'Arrgh this is a spam post', content_type: 'spam'}
             ])
```
and run
```
$ rake db:seed
```
Conclusion

This solution, as it is, is just a first dip into the problematic of filtering spam from your user input. There are some drawbacks in the current solution:
- Performance: As described above, every request will query all posts to create the classifier.
- Verification: You need to manually verify content to get good results.
- Unflexible: Right now a user can't mark posts as spam nor can he do anything if their post a falsliy classified as spam.
Make sure to checkout the source code.

Language detection

Spam filter

Conclusion