• Language detection

    Let's write a simple language detector. It will recognize german and english texts. But you can add more languages with ease.

    We will need the NaiveText gem, so go ahead and install it: gem install NaiveText We will also need some examples text from both languages. So create a directory training with two subdirectories training/german and training/english. Now we put some trainings material into those directories. Normally some texts will do. But I decided to use a list of commonly used words for each language. You can find it together with the code here.

    require "NaiveText"
    
    class LanguageDetector
      def initialize
        german_examples = ExamplesFactory.from_files('training/german')
        english_examples = ExamplesFactory.from_files('training/english')
    
        categories_config = [{name: 'german', examples: german_examples },
                             {name: 'english', examples: swedish_examples},
                            ]
    
        classifier = NaiveText.build(categories: categories_config)
      end
    end
    

    The above Code does two things:

    1. It loads our examples (saved previously).
    2. It builds a classifier.

    To build a simple classifier you only need to specify an array of 'categories', consisting of a name for the category and some text examples as an orientation for the classifier

    Next we will write to method to interface the classifier:

    require "NaiveText"
    
    class LanguageDetector
      def initialize
          ... #like above
      end
    
      def get_language(args)
        puts classifier.classify(args[:text]).name
      end
    
      def probabilities(args)
         classifier.probabilities(args[:text])
      end
    end
    

    The first function get_language takes an text as a keyword argument and returns the corresponding language as specified in our categories array. Note that classifier.classify return an catgeory, so you need to call name on it to access the language name as a string.

    The second function probabilities also takes an text as a keyword argument but it returns an collection of probabilities. You can print it using puts (example below).

    Now we can use our Language detector:

    ... # Code from above
    
    detector = LanguageDetector.new
    
    puts detector.get_language(text: 'This is obviously an english text') # ==> english
    puts detector.get_language(text: 'Der Text ist deutsch')              # ==> german
    
    multi_language = 'This is an english text containing just one german word "Herren"
    meaning gentleman. Lets see what happens.'
    puts detector.probabilities(text: multi_language)
    

    The full source of the example can be found here.

  • Spam filter

    In this example we will write a simple spam filter integrated in a rails app which lists user posts. It shows you how you can reduce spam in user created rails models.

    So we begin by creating a simple rails app.

    $ rails new SpamFree
    

    And scaffold the posts resource, which contains one field for the posts text and one for the content_type. The content_type of a post is one of the following strings:

    'spam', 'content' or 'verified_content'

    and will be set by our spam filter. As you will assume models with a content_type of 'spam' are spam and aren't accessible by the user. The difference between 'verified_content' and 'content' is, that veriefied_content is checked by you (or a moderator) and is guaranteed to contain no spam. It is used to train the spam filter.

    So go ahead an scaffold your rails app:

    $ rails g scaffold Post text:string content_type:string
    $ rake db:migrate
    

    We will need the NaiveText gem, so go ahead and put it in your Gemfile:

    gem 'NaiveText'
    

    Make sure to run bundler:

    $ bundle
    

    Also be sure to set the root route to 'posts#index'

    For easier querying we add three scope to the Post model (learn more about scopes in the rails guides).

    class Post < ActiveRecord::Base
      scope :spam, ->{ where(content_type: 'spam') }
      scope :content, ->{ where.not(content_type: 'spam') }
      scope :verified, ->{ where(content_type: 'verified_content') }
    end
    

    The actual spam filter is pretty simple. (It is actually a singleton ruby class with one public method):

    class SpamFilter
      def self.spam?(post)
        self.build_classifer
        @classifier.classify(post.text).name == 'spam'
      end
    
      private
        def self.build_classifer
          categories_config = [{name: 'spam', examples: Post.spam, weight: 1},
                               {name: 'content', examples: Post.verified, weight: 10}]
          @classifier = NaiveText.build(categories: categories_config, default: 'content')
        end
    end
    

    To integrate the spam filter in our rails app. We need to to make some adjustments to the PostsController. At first we want to only list posts which are not categorized as spam. So in the index action we replace Post.all with Post.content.

    class PostsController < ApplicationController
      def index
        # Changed Post.all to Post.content
        @posts = Post.content
      end
      ...
    

    Same for our set_post method.

        # Changed Post.find to Post.content.find
        def set_post
          @post = Post.content.find(params[:id])
        end
    

    We also need to remove the content_type from our permitted parameters. This way no user can submit a post and send a content_type along.

        # Remove content_type
        def post_params
          params.require(:post).permit(:text)
        end
    

    The real filtering happens at the controllers create action: We simply ste the posts content_type based on the response of SpamFilter.spam?.

      def create
        @post = Post.new(post_params)
        if SpamFilter.spam?(@post)
          @post.content_type = 'spam'
        else
          @post.content_type = 'content'
        end
        respond_to
          ...
      end
    

    Before we can use our app, we need to make sure to put some example posts in our seed file

    Post.create([{text: 'Hello I am a friendly post', content_type: 'verified_content'},
                 {text: 'Arrgh this is a spam post', content_type: 'spam'}
                 ])
    

    and run

    $ rake db:seed
    

    Conclusion

    This solution, as it is, is just a first dip into the problematic of filtering spam from your user input. There are some drawbacks in the current solution:

    • Performance: As described above, every request will query all posts to create the classifier.
    • Verification: You need to manually verify content to get good results.
    • Unflexible: Right now a user can't mark posts as spam nor can he do anything if their post a falsliy classified as spam.

    Make sure to checkout the source code.

subscribe via RSS