I'd like to classify larger article-sized bodies of text. One of the ways I'm working on doing this is by doing text classification. Given 5 categories of diary entry (eg family, health, spiritual, work, recreational), would it be preferred to use a single model that labels text with one of the 5? Or should I follow the SentimentClassifier example and use 5 separate models that each classify a string in 3 ways (notFamily, neutralFamily, isFamily)? If the latter, is this a use case for components?

Usually what I would tend to try first would be a single classifier model that labels text with one of your classes. That should be efficient and robust, and scale reasonably well with the number of classes. If you train multiple models, you would need to run each of them on each article you want to classify.

However, the questions being asked are subtly different in the two different cases. With a single model, you are asking which category or categories best match a given document. When you use separate models, you are asking whether a given document relates to a given category or not. The question you want to ask informs everything from your annotation to the type of model you use to the way you present your results.

Ultimately you will have to decide what analysis is best suited to your particular application.

Tagged with: