I want to start learning ML and Core ML, and I have thought of a problem space that may be interesting that could use further exploration. I know NLP depends on extraordinarily large data sets, but I'm wondering about the utility of training a model on a constructed language with a much smaller data set. The one I have in mind has a very small official dictionary (slightly more than 100 official words), and rather simple grammar rules. Are there resources you would recommend for exploring this specific application of ML, or any pitfalls I might want to keep in mind?

This depends somewhat on what sort of tasks and models you are interested in.

For example, for a classification task, the maxent classifier available through Create ML is not language-dependent. It should be able to take on classification tasks in an artificial language of this sort. Gazetteers are language-independent, so they would still be usable.

Our built-in embeddings are language-dependent, so they would not be of use here.

If you want to train your own embedding or language model using open-source tools, that probably would still require significant amounts of data, but perhaps not as much as with natural languages.

Language modeling techniques have recently been applied with some success to programming languages. If your rules are similar to the syntax rules of programming languages, you might consider using the sorts of parsing tools that are used for them, but that is really a different area than NLP.

Tagged with: