Is it appropriate to try to use word embeddings to match long-form text up to single worded categories? For example, figuring out the distance between `"exercise"` and `"Today I decided to ride my bike to the store. I needed to get a workout in."` I'd like to match sentences and paragraphs up to to tags.

The most robust approach to this sort of categorization would be to pick a set of categories in advance, collect training data, and train a classifier to classify sentences according to these categories.

If you need to handle words outside the originally chosen set of categories, you could then use word embeddings to find an existing category similar to the entered word.

If you aren't able to train a model, things get a bit trickier. You can use tools such as part-of-speech tagging to identify relevant words in a sentence, e.g. nouns in the example you give, and determine how similar those are to the word you are trying to match. You would then need to figure out some way to take scores for individual words and form a score for an entire sentence.

Overall I think you would get better results by training a classifier, although it would require more work in advance for training.

Tagged with: