WWDC22 - Machine Learning Digital Lounge

24 Jun 2022

Questions and answers collected from the WWDC22 Machine Learning Digital Lounge, which was held from 07 - 10 June 2022.

All

When training an Object Detection Model in Create ML, is there a way to take automatic snapshots to ensure that there is a record of how the model is performing throughout iterations as opposed to having to take manual snapshots to do so?

In the Create ML app this is a manual process. You can also set options on the project to automatically snapshot on pause or train more. Having an automatic option is a great feature request. Please consider filing a request using feedback assistant.

You do have the option to use the Create ML framework directly in Swift and set your own cadence of checkpointing.

MLObjectDetector

Can I train a tabular classifier model on iOS?

You absolutely can! Have a look at last year's session on building dynamic apps:

Build dynamic iOS apps with the Create ML framework (WWDC 2021)

Also, be sure to checkout the Get to know Create ML Components session that dropped today. There, Alejandro walks through building a tabular regressor all in Swift.

When training a video action classifier model in Create ML, is it best to have only one person's poses in the frame (and crop out others)?

Yes.

If you have multiple people in the screen. Try to keep other people consistently smaller than the main person. Then it will still work (automatically select the maximum bounding box person)

Check out the video from WWDC 2020:

Build an Action Classifier with Create ML (at 24m21s)

When it comes to using the model in your applications, make sure to only select a single person. Your app may remind users to keep only one person in view when multiple people are detected, or you can implement your own selection logic to choose a person based on their size or location within the frame, and this can be achieved by using the coordinates from pose landmarks.

This may be a more general Machine Learning question. I am interested in the ability to extract text from images and video. The Vision framework does a great job of extracting the text. One thing I would like to do is determine whether each piece of text is in a particular category of typeface, largely looking to tell a source code / monospace font from a sans serif font. Which machine learning technologies available on Apple platforms would be best suited for that? And a high level of how you might approach that?

So the Vision framework, where you extract the text, tells you the region in the image where the text is; the first thing to do would be to crop the text out of the image.

If you have a binary image classifier (sans serif vs serif, or “looks like source code” vs “doesn’t look like source code”, it’s worth experimenting with what definition works best – and you’d need to collect samples of each for your training set!), you can then throw that crop to this classifier to work out whether it’s source code or not.

So at a high level, what I’d do is:

train a binary classifier to distinguish source-code from not-source-code
using Vision, crop out the region of the image with detected text in
use your classifier to determine whether it’s source code or not

and go from there!

Also you can try out the Live Text API this year, it's able to extract text of out images and videos. However it does not provide font-related information of the text yet. You can file a bug tracking this issue to us if needed.

From a non-Apple developer:

I created a serif/sans-serif model with CreateML. You can find it here: https://github.com/jmousseau/Mimeo

In the "What's new in Create ML" talk, near the end Repetition Counting was mentioned, and a reference to the "linked article and sample code", yet the WWDC22 Sample Code list does not include this, nor does the documentation, I believe. Can you point me to the Sample Repetition Count code and Documentation?

Thanks for waiting. Try this link:

Counting human body action repetitions in a live video feed

When making action classification predictions on video frames, are there ways or considerations for improving speed aside from skipping frames (cadence)? Is prediction window size a factor?

The window size is fixed once a model is trained. At prediction time, you may adjust something called stride , i.e., how often (in terms of number of frames) you create a new prediction window. The smaller the stride , the faster you make a next prediction. This can make the results to refresh faster.

There was a sample code from last year, called Detecting Human Actions in a Live Video Feed, you may check this as a reference

Your app - and the model - need to balance three concerns: Accuracy, Latency, and CPU Usage.

A shorter window size (selected at training time) may reduce run-time latency! But if the window is shorter than the longest event you are trying to recognise, then accuracy may be reduced
A shorter stride can also reduce latency, but results in more frequent calls to the model, which can increase energy use.
A longer stride reduces CPU usage, and allows other functions in your app such as drawing functions more time to work - which might help reduce any performance related bugs in your app, but will increase latency.

Also, try not to skip frames (unless you did the same for your training data), otherwise, the effective action speed captured in the window will change, and this may mess up the prediction result (accuracy) (edited)

Latency: is the time between when the action happens in the real world to when the app detects it and tells the user. This will be experienced as 'snappiness'. For action classifications, latency is at least as long as the windows size, plus a bit. It's an emergent parameter caused by other choices you make.

Stride: is the time the model waits in between two detections. This can be shorter OR longer than the window length. It's a parameter used in the inference pipeline in your app.

Latency can be slightly shorter than the window size, cause the model can predict the action when it sees the partial action. That’s why a shorter stride can help to refresh the results faster (edited)

You may also be able to reduce energy, memory and CPU usage by setting the camera to a lower resolution. If your app is very busy, making this change might improve app performance. This change should not reduce model accuracy, but your user's view of the scene will not be so sharp.

Another inference speed factor might be your device. A newer device with neural engine will help. A fully charged device, and operating in good lighting condition, etc. can also help to reduce the latency (improve the Vision pose detection speed, and camera frame rate)

We have a great session on understanding the performance of apps with machine learning models this year! Check it out:

Optimize your Core ML usage

How would you recommend to approach the classification of various fine-grained subclasses of the same class? Specifically talking about different types of something made of paper. For example: "a postcard with something written on it", vs "an empty postcard" vs "just some piece of paper" vs "another object"? With a classifier model we were able to obtain very accurate results to distinguish "paper vs some other object". However we couldn't get accurate enough results (I think ~60% accuracy) regarding the more fine-grained decisions: "postcard vs some piece of paper" and "postcard with text vs empty postcard". The mistakes were usually into false-positive side (identifying some piece of paper as a postcard in my example). So how would you setup the training samples for this sort of goal? Or are we looking in the wrong option, and should be considering some other method, or a combination of methods instead?

Something you could try is doing a hierarchical approach where you first detect the overall class and then crop and do a more precise classification.

In addition to tuning the training, you can also tune the data you're training on like you mention. Adding more examples (width breadth to encompass you expect to see in practice) where you're getting the false positives might help iron out those edge cases. There's some data augmentation options in both the CreateML App and CreateML Components to that could help grow your sample size and add some diversity to it without collecting monumentally more data.

Hard to say if it's feasible. But getting text definitely sounds better suited if your sub-classes are always text based.

Does an MLModel need to be adapted in any way to support predicting into buffers given with MLPredictionOptions.outputBackings?

MLModel does not need to be adapted in any way to accept output backings. MLPredictionOptions.outputBackings can be used to provide either a CVPixelBuffer or MLMultiArray depending on the output feature value type. When a CVPixelBuffer is provided as output backing for prediction, please ensure that the base address is not locked for read / write. Please check out Optimize your Core ML usage session tomorrow.

I‘m trying to get into Machine learning. What’s the best way to get to know all the methods for integrating CreateML into an app?

That's exciting, glad you're interested! I can definitely recommend this year's (and previous years') sessions on CreateML. Each help go over part of CreateML and might be useful to you.

Here are some videos that can help you start from scratch. As others suggested there are a lot more things added to Create ML in the last couple years.

Introducing Create ML
Introducing the Create ML App
Build an Action Classifier with Create ML - This video has a nice example project that you can follow through
Get to know Create ML Components

Choosing the right task and understanding the data needs for training are a great place to start.

I would like to build an app to predict stock price trends and classify expenses in a bookkeeping / tax management app. How might I get started with Create ML?

So for something like that, you might look into Tabular Regressors for predicting future prices based on past price data (and whatever other data you want to use), and Tabular Classifiers to classify expenses into categories you select.

You might be particularly interested in training on-device using the Create ML framework for this... given the historical data would be highly personal and continually changing (if I'm understanding the use case). The Tabular Regressor example that Alejandro shows in the Get to know Create ML Components session is almost spot on with the problem you're trying to solve.

If the textual input is more varied and similar to human language then you might use a Text Classifier. If it's more like terms and you want to include additional context such as price (numbers), then the Tabular Regressor is more suitable.

Create ML Components looks amazing!! Have you experimented recreating popular architectures, like GANs and Autoencoders, with Create ML components?

I'm glad you like it. Components allow you to work with a lot of different architectures.

Create ML Components does not support building or training neural networks. It's closer to something like scikit-learn

I utilize the drawing classifier from Turi Create in my app which has been working well. I previously tried an image classifier with Create ML but it took very significantly longer to train (about 55 hours vs 3.5 hours). Is Turi Create still the way to go until a drawing classifier gets added to Create ML? :D Bonus q: any Turi Create updates to share?

Create ML does not have a drawing classifier template or task API. You may want to check out the new Create ML Components framework which will let you construct your own pipeline similar to the one used in Turi Create.

Turi Create is still a good option if it's working for you. However, Turi Create is no longer under active development

Note: The updatable drawing classifier available on https://developer.apple.com/machine-learning/models/ has a pre-trained feature extractor for drawings as the first model in its pipeline. You could use this sub-model as a custom feature extractor.

Please consider filing feature requests or feedback on drawing classifiers via https://feedbackassistant.apple.com

What are some ways to apply Create ML and Core ML to everyday tasks? What are the best tasks for Core ML models?

Really, the limit’s your imagination! But machine learning methods, generally, work well if you have:

a well-defined objective (in other words, a clear, unambiguous criterion which tells you whether a classification is the right one or how far an estimate is from its true value)
enough training data.

That’s true of a lot of different problems! Image, audio and text classification are all things which are applicable to a lot of real-world problems.

The key way to take advantage of ML is identifying problems where the goal can be clearly defined but the method is tricky. It’s not worth training a ML model to say whether something is red or blue – the average pixel value tells you that. But determining the class of an object (or reading text out of an image, or…) – these are problems where you can define what a success is unambiguously, but coming up with heuristics for the problem is harder. That’s where you’ll get most “bang for the buck” from ML methods!

Can a GAN solution be made from ML Components?

If you already have a GAN model, then you can use it in Create ML Components using an adaptor.

Create ML Components does not offer out-of-box support for any specific GAN model, but you are able to make a transformer to use a GAN under the hood.

If you don't already have a model, I’d suggest starting from checking these models (where they are published) that you are interested, see if they can be converted to MLModels, if you want to perform inference (deploy) on Apple hardware. Create ML or Create ML Components do not offer out-of-box support for such models, it meant to give you the flexibility to do so.

There was a great WWDC video back in 2018 titled Vision with Core ML. The example app used live video capture to feed an image through the model using scene stability. There is sample code out there for UIKit, but always wanted to try and re-make using SwiftUI as a starting point. Any tips or pointers on making an image recognition app with live video capture using SwiftUI as a starting point ?

You have a couple choices here. You can use the techniques you already know, and use UIViewRepresentable to import the UIViews into your app. That will still work really well!

I've actually done exactly that using UIViewControllerRepresentable in the app I'm working on, and used a modified version of the VisionViewController which I think came from that video's sample code. It works perfectly... the entire app is SwiftUI except for the VisionViewController part.

Here's a link to the original sample code that has the VisionViewController (click the download button from that page): Training a Create ML Model to Classify Flowers. Unfortunately I don't have a publicly available version that shows where I've used that with UIViewControllerRepresentable.

Alternatively, you can display the camera feed by sending the image from each frame into a SwiftUI Image. The new Create ML Components feature actually includes a VideoReader.readCamera method which provides an asynchronous stream of image buffers, which is a great way to get started with this approach. Alternatively you can use your existing AVCaptureDevice logic and delegate methods to provide a series of images. You can see an example of this approach in the rep counting demo app which will be available soon as part of this year's WWDC session What's new in Create ML

Recently I refactored some UIKit code into SwiftUI and found that the ViewController could be relatively easily transformed into an ObservableObject class, changing @IBOulets into @Published properties.

I've been experimenting with using AVCaptureVideoPreviewLayer too. Using a preview layer is likely to be a very CPU / memory efficient way to do it. I don't yet have a clear picture of the efficiency of using SwiftUI views to run the preview directly in the way I proposed, so while it can be a lot of fun to do it purely in SwiftUI it may not be the best choice just yet.

When will Create ML support Neural Networks?

Just to clarify, it does support neural networks. For instance FullyConnectedNetworkClassifier.

But if you wanted to create a custom network you would need to use Metal or Accelerate.

We can separate people and background on photo, for example to create stickers, using VNGeneratePersonSegmentationRequest. What about animals, objects and etc like you did it in iOS 16? I mean feature that I can long-press at any object on photo to copy/paste it. Do we have ready API for that?

Yes, you can use VNGeneratePersonSegmentationRequest for people. There currently is no equivalent for animals or other objects via the Vision APIs.

Please consider filing a feature request or feedback via https://feedbackassistant.apple.com or bringing more questions to the Q&A Vision digital lounge from 2-3pm on Thursday

I've created a small image classification model using Create ML with around 350 image labels. However, for the iOS app I'm making that could scale to over 100,000 labels (and likely much more) - each with over a hundred images for training/testing. Is there any way to scale to that level using Create ML? I've started teaching myself TensorFlow and researching various cloud services like Google Colab for the training because I think I'll need to go that route... and then convert that to Core ML. I'd appreciate any thoughts / recommendations.

Wow, 100,000 classes! That’s a hard problem – in fact, it’s something which would be at the research cutting edge. Is there any structure to your classes? That might be something you could exploit.

It’s definitely worth having a look at the literature here — these tend to be called “extreme classification tasks”

For instance http://manikvarma.org/pubs/bengio19.pdf

This is a review which covers some of the problems in this area. However, if you have a natural hierarchy to your labels, you might consider having a hierarchy of classifiers. Let’s say we’re talking about animals; you might have a classifier from “animal” to “bird”, “mammal”, “reptile”, etc etc etc, then from “bird” to bird species

That way each classifier is only predicting among, say, 1,000 classes – which is a more tractable problem

If there is no class hierarchy, and you're looking to recognize individual identifiable objects, you may want to check out algorithms behind similar image search, where you calculate an embedding for an image, then find N nearest embedding from the known images, and derive your class/object from them.

Is there the ability in Create ML to make text generation models, GPT-3 type applications? Haven’t seen it but wanted to double check

No. Create ML or Create ML Components are meant to allow you to create custom ML models fitted to your training data. If you want to use such model, it makes sense to get that converted-to-CoreML model to try it out.

I'm new to ML. I would like to implement some sort of color matching with two photos (i.e. when superimposing a person on a different background, adjusting color, contrast, etc. to match the background better). Is something like that suited for Core ML (and if so, do you have any suggestions on how to approach that?), or would a simple algorithm be a better solution for those kinds of tasks?

Doesn't sound like something you can do with Core ML or Create ML. Try the Vision API Q/A on Thursday from 2-3 PM PT.

Is there a list of transformers and estimators available for use?

We have many transformers and estimators. A good place to get the list is the developer documentation.

https://developer.apple.com/documentation/createmlcomponents

Can you use any iOS graphics API for data augmentation?

Yes, you are free to use any API. The only requirement is that it produces a CIImage.

Can someone clarify the difference between a classifier and regressor again for me?

A classifier is used to predict a discrete categorical value. Think of it as predicting an enum.

A regressor is used to predict a real value. Think of it as predicting a float or double.

In the first demo in Get to know Create ML Components a regressor was used to predict a ripeness value.

A classification approach to the same problem would require you to define categories like “green”, “not ripe”, “ripe”, “over ripe”. This is an option as well, but you would not be able to compare the ripeness of two examples that got classified into the same category.

How about run time efficiency between the above mentioned classifier and regressor?

They should have similar prediction compute time if you are using a common feature extractor. For image models that feature extraction step will likely dominate inference compute. LogisticRegressionClassifier and LinearRegressor will both be doing a similar size matrix multiplication behind the scenes, particularly if you restrict yourself to a few classes. As the number of classes increase the classifier will become slower.

Can you point me to information on the TabularData framework?

It's really cool! You can find the docs here: https://developer.apple.com/documentation/tabulardata

There's also the Tech Talk, Explore and manipulate data in Swift with TabularData

How can I fix a “can’t find MLLinearRegressor in scope” error?

MLLinearRegressor is a symbol from CreateML, do:

import CreateML

instead of:

import CoreML

The quickest way is to try that in an Xcode Playground

Note: CreateML is NOT available on an iOS simulator, so when you build an app for iOS, please target a physical iOS device if you want to build & run. If you just want to build without worrying a physical iOS device, choose any iOS device to build.

If you want to try with a Playground, please use macOS Playground, since iOS Playground uses iOS simulator as well.

Would Create ML/Components/Core ML be capable of dealing with music data? Specifically I want to train a model that can predict the tempo (BPM - beats per minute) of a song.

Core ML models can support audio data through MultiArray inputs of audio samples. Create ML does support audio/sound classification but not tempo estimation.

Doing a quick search online it seems there is some work on models for tempo estimation, many of which could be converted to Core ML

How can I ball park the runtime for linear regressor predictions? What I really want to know, how often I can call this?

I love this question, and we have a great session for you which dropped today.

Optimize your Core ML usage

Why do older devices use more memory when loading MLModel? A model used about 600MB of memory on the iPhone 12 Pro, but on the iPhone 11 Pro the app crashed over 1.2GB of memory.

We need to look at the model in question. Please submit a problem report through Feedback Assistant.

We are not aware of such memory usage issue in general between the hardwares.

What if my data is exponential, would that need a quadratic regressor? In one of the videos the data was parabolic and you normalized it. What’s going on here?

So this is a technique from classical statistics, but basically a common approach is to take a raw set of data and normalize it before applying a fit function. You could run data with an exponential distribution through a log transform, for example. One of the risks if you don't is that the leverage from a few points can be very high and skew your model. There's tradeoffs to this approach and many different techniques that can help you find a good model, but data normalization is often a helpful technique.

I think in general it can be helpful in data preparation as one tool that's available to use. But there's an entire field on this and I don't want to oversimplify.

I've noticed you have a new way to store model weights in sparse form (which is a great addition) and am wondering it there's some fundamental blocker to using sparse operations at inference time too?

The sparsity is leveraged during execution on certain compute units such as the Neural Engine.

This is not generally the case for any model which contains sparse matrices. It is best to use the sparse encoding explicitly.

We're facing some challenges on converting our AI models to Core ML, some operations 'state-of-the-art' aren't fully supported and we're considering running it in a different approach. Would it be feasible to have a model in C++ and leverage the GPU power of the devices, if yes... how? Is there any workaround for torch.stft and torch.istft?

A composite operator may help you convert these operations:

https://coremltools.readme.io/docs/composite-operators

In some cases, you can also supply a custom operator but to leverage the full Core ML stack its best to see if you can represent the functionality as a composite op first

Are the FFT operations in the beginning / pre processing stage of the model? If so, then you can accelerate the rest of the model by converting it to Core ML, and implementing the FFT operation using BNNS or Metal.

In any case, submitting a feedback request with your use case would be great.

Are there any updates planned for coremltools, notably supporting complex numbers. I'm looking to convert a TensorFlow model, but even after implementing missing operations like Fast Fourier Transform, I'm blocked by the fact there is no complex number support.

Yes, you are right. There isn’t a complex number data type in the CoreML MIL spec.

The best way to use them is by treating complex numbers as 2D vectors of real and imaginary numbers.

Hi, can an ML model extract certain values of a json such as “VideoType” and URLs then return those values to make a network request? I’m looking a making a video recommendation system with ML but not sure the best way to do it.

Have you tried the MLRecommender in Create ML? That is a great place to start for building recommender systems. Since your data is in json format, you can also try the TabularData framework which can help with the json loading!

Can the number of features vary? For example, say I want to predict a cat's age given the number of whiskers, fur density, and number of legs. However, sometimes I may only have number of whiskers and number of legs, but not fur density. Would that require its own, separately trained, MLModel?

This is such a cool example! Thanks for the question. Have you tried using the tabular classifiers in Create ML? When you have missing data in your feature columns you can try replacing them (imputing). The TabularData framework makes this part really easy.

I want to start learning ML and Core ML, and I have thought of a problem space that may be interesting that could use further exploration. I know NLP depends on extraordinarily large data sets, but I'm wondering about the utility of training a model on a constructed language with a much smaller data set. The one I have in mind has a very small official dictionary (slightly more than 100 official words), and rather simple grammar rules. Are there resources you would recommend for exploring this specific application of ML, or any pitfalls I might want to keep in mind?

This depends somewhat on what sort of tasks and models you are interested in.

For example, for a classification task, the maxent classifier available through Create ML is not language-dependent. It should be able to take on classification tasks in an artificial language of this sort. Gazetteers are language-independent, so they would still be usable.

Our built-in embeddings are language-dependent, so they would not be of use here.

If you want to train your own embedding or language model using open-source tools, that probably would still require significant amounts of data, but perhaps not as much as with natural languages.

Language modeling techniques have recently been applied with some success to programming languages. If your rules are similar to the syntax rules of programming languages, you might consider using the sorts of parsing tools that are used for them, but that is really a different area than NLP.

We converted a Tensorflow image segmentation model to Core ML. We notice that we get different results when running this Core ML model on macOS (with Python3 and coremltools) versus on iOS. Predictions are way less accurate on iOS and we cannot explain why (even when setting computeUnits parameter to .cpuOnly).

Have you tried setting

python compute_units=ct.precision.FLOAT32

as described here?

That said, for the CPU compute unit I would expect the predictions to match between iOS and macOS. There could be differences with the “all” compute unit depending on the actual hardware the model runs on Mac and iOS which can be different. If you could file a feedback request with the code to reproduce these differences that you are observing that would be great!

What would be the best way to figure out which objects go together - say you have 10 groups of three and a pool of 100 ungrouped objects & you want to group them similarly?

Thanks for asking! It would be helpful to clarify what the "objects" are in this context. If they are objects within image data, you could leverage the Create ML feature extractor to turn it into structured tabular data.

From there, it's a classical unsupervised clustering problem, for which there are several approaches. Something like k-means is a quick and effective approach that might work in your case.

https://apple.github.io/turicreate/docs/userguide/clustering/kmeans.html

There is also a CIKMeans filter in Core Image. That may be exactly what you need.

Do you know if it is possible to have a layer-wise execution time profiling with XCode 14 for the operations that run on the Neural Engine or GPU?

The new performance reports give a layer-wise break down of compute unit support, but not execution time. Please consider filing some feedback at:

http://feedbackassistant.apple.com

Do you have some method to share with us to benefit from sparse weight features (very nice features) without sacrificing the applicative performances?

The sparsity weight feature will be useful for models that have been trained with “weight pruning” techniques

Core ML Performance Report is great, but I can't find per-layer performance stats to find bottlenecks in our model.

The new performance reports offer per-layer compute unit support, but not per-layer timings. One further step you can take to find bottlenecks in the model is to press the "Open in Instruments" button where you can see further details in the Core ML Instrument. This won't offer per layer timing details, but it can help find bottlenecks related to data operations and compute unit changes.

Is there a way to run performance report on older versions of iOS? I suspect new compiler runs model differently than the Xcode 13 one.

Performance reports require iOS 16. Older iOS version unfortunately cannot provide the same information.

What is a good approach in case of image classification problem. I am trying to classify two similar shapes - let's say a circle and an oval, in some case the confidence for the oval is very high for the circle input.

Have you looked at VNDetectContourDetection? Using this traditional computer vision approach might give you better results.

Vision question: does a VNRectangleObservation contain information about the shape's full outline? For example: a document scanner that needs to fully extract a page from its background. VNDetectRectanglesRequest will provide the position of each corner of the page, which allows us to clip the shape assuming the page is flat. But if the paper is curled, we can end up with bits of background in our cropped image. Is there a way to trace an accurate outline for imperfect rectangles?

For imperfect documents, you might want to look at VNDetectDocumentSegmentationRequest and they combine it with a contour detection on the globalSegmentationMask that you get as a result.

https://developer.apple.com/documentation/vision/vndetectedobjectobservation/3798796-globalsegmentationmask

It is a low res pixel buffer that represents the shape of the detected document. Where each pixel represents a confidence of being or not being part of the document

Document segmentation is ML based and trained on all kinds of documents, labels, papers, etc. Rectangle detection is a traditional algorithm that works on edges that intersect to form a quad.

There are a lot of CoreMLCompiler versions throughout the Xcode history. Some break inference (e.g. some of the Xcode 13 coremlcompilers broke iOS 14 runtime). Is there a way to diagnose these errors without compiling under every compiler and running on all iOS devices?) And is there a known stable version of a compiler?

Maintaining backward compatibility is really important for us. We would like to understand this better. Could you file a bug report on:

feedbackassistant.apple.com?

If it works for you, could you please set up a 1:1 lab with Apple engineers so we can understand the issue better?

May I ask if there is some functionality to enable to recognize the direction (arrow) of time in a video?

Interesting problem! You may find understanding the apparent motion of pixels in the image a useful input. You can compute optical flow using Vision VNGenerateOpticalFlowRequest

I was surprised that the session video did not discuss (unless I missed it, which is definitely possible) how to write custom components. Let's say I want to write my own PoseSelector component. For example to pick the person closest to the center of the frame, and to keep the selection consistent across frames in a video. Can I? And if yes, how?

Hello! Thanks for your question. You can definitely build your own custom components. In the session Get to know Create ML Components, Alejandro provides an example of how to do it by building a saliency transformer.

All you need to do is to conform to the Transformer protocol!

The only required method for you to implement is applied:

https://developer.apple.com/documentation/createmlcomponents/transformer/applied(to:eventhandler:)-38h86

Search for Required on the page

In addition to the built-in classifiers and regressors, is it possible to specify a neural network with a custom structure (by specifying the layers) and use that for training and inference?

You can use a FullyConnectedNetworkClassifier or regressor, which support specifying the number of hidden layers and the number of hidden units in each layer.

There are ReLUs on every hidden layer. Other activations functions are not available.

Other network architectures would require using Metal or Accelerate

If you have a use case please file a ticket in Feedback Assistant.

Does training (using the fitted or update methods) take advantage of the Neural Engine, for example when training a fully connected classifier?

Not on Neural Engine, but training of fully connected network is optimized on best possible compute unit.

When running model.predict(), I get "Error in declaring network." What does that mean?

This indicates that there is an error setting up the model. Can you please file a feedback report with code to reproduce the issue? Also can you verify if this issue reproduces with different compute unit options used when loading the model?

How do I go about training a dance classifier with with video files? Are there any components for audio, video to get started with?

You can definitely build a dance classifier using the action classifier in Create ML. See the following session from WWDC 2020:

Build an Action Classifier with Create ML

With the initializers for MLImageClassifier.ModelParameters being deprecated, what is the easiest way of increasing the iterations being performed?

MLImageClassifier.ModelParameters still has the

swift init(validation: ValidationData, maxIterations: Int, augmentation: ImageAugmentationOptions, algorithm: ModelAlgorithmType)

initializer, which you can use to set the maxIterations along with other parameters that you want.

There are a few other initializers that were deprecated so if you are trying to use one of those old ones to set the maxIterations, you will get this warning.

Note that this particular initializer has no default value for augmentation . So, when you call this initializer you need to pass the augmentation parameter as well. If you don't want to set any augmentation, you can just pass a value of 0 for this parameter. If you do not specify augmentation in the init at all, one of the old initializers will be used and hence you will get the deprecated warning.

Is there a way to use one IOSurface for both ANE and GPU work? Or access ANE IOSurface directly, and map it to MTLTexture by hand?

IOSurface-backed CVPixelBuffer with OneComponent16Half pixel format type can be shared with Neural Engine without copy. Likewise, MLMultiArray which is backed by the pixel buffer can also be shared without copy. (See MLMultiArray.init(pixelBuffer:shape:)).

For input features, using these objects in MLFeatureValue is enough to take advantage of the efficient data processing. When the output feature type matches the type above, Core ML automatically uses these objects in the output feature values as well.

For output features, you can even request the Neural Engine to write into your own buffer directly. See MLPredictionOptions.outputBackings.

You can create MTLTexture view into the pixel buffer with CVMetalTextureCache.

Is there any chance you will add support for at least 4 channels no-copy buffers?

One way is to stack each channel on height axis and use that big pixel buffer as a backing of MLMultiArray. So:

// size = (width, height * 4) // format = kCVPixelFormatType_OneComponent16Half let backingPixelBuffer = ... let multiArray = MLMultiArray(pixelBuffer: backingPixelBuffer, shape: [4, height, width])

This won’t work if your image representation is so called “packed” format, where each channel is interleaved in the frame buffer. We appreciate your feedback assistant report with a use case if that’s what you are looking for.

Are float16 MLMultiArray no-copy also? Will there be a copy event if I specify user-allocated MLMultiArray with float16 data into outputBackings?

The Neural Engine is unable to write into MLMultiArray directly unless it was initialized with an IOSurface-backed pixel buffer.

MLMultiArray.init(pixelBuffer:shape:)

Is direct writing to IOSurface-backed buffers compatible with flexible shape input/outputs?

As the application needs to prepare the output buffer in outputBackings property, it must know the shape of the output before invoking the inference. Some flexible shape models (e.g. enumerated input shapes) meet the criteria, but so called “dynamic shape”, where the output shape is dynamically determined by the network, won’t work.

As far as I understand, extracting text from images is not possible for Arabic language, would it be possible to use Create ML to achieve the same effect that is built in to extract Arabic text from images and documents?

You are right about Arabic support. While Apple announced more language support for Live Text this year, Arabic was not one of them. The complete list is here:

https://www.apple.com/ios/feature-availability/#live-text-live-text

It's not possible to extend Live Text to add additional user languages at this time.

To build your own system would require solving multiple ML problems, including locating text in the image, decomposing it into graphemes (characters), and to be robust it should probably include some sort of spelling/grammar layer to reduce transcription errors.

Building a complete solution like this is beyond what Create ML is designed for today.

Is there any guidance on using Create ML for creating segmentation models (great for use in ARKit to segment unique types of objects as detected in the camera feed)? Or would this be a case for building a custom Turi/other model and converting to Core ML?

The first thing I would try is the DeepLabV3 model! Which can be found here:

https://developer.apple.com/machine-learning/models

You can also convert your custom model using coremltools

In the "Get to know Create ML Components" session around the 14m mark, the presenter mentioned that the augmentation applied is only used during training, not validation. Is that really true, given that it was just applied using flatMap() to the combined dataset in the code shown? It is not what I would expect based on reading the code.

It applies for training and validation data, but only at training time, not when doing predictions.

Is there a way create a federated learning/training solution. I'm planning to create on-device training and since the data is private I'd want to update the model while respecting user privacy and have all users benefit from a new model I can redistribute in updates. Is there a way to achieve this? Especially while respecting user data privacy.

Apple doesn't offer an out-of-box solution for federated learning at the moment. The differential privacy team at Apple published this article to discuss how to think about this problem:

https://machinelearning.apple.com/research/learning-with-privacy-at-scale

When building an Object Detection model, do you have any specific tools or recommendations on how to best annotate objects? There are a lot of tools out there, but they often feel cumbersome to use in comparison to the ease of Create ML.

Check out

https://developer.apple.com/documentation/createml/building-an-object-detector-data-source

for documentation on how to structure annotated data for object detection. Apple doesn't have a tool we can provide to create the annotations, and I agree it can be a bit of a cumbersome process. Hope that helps!

Some options suggested by non-Apple developers:

Roboflow - though if their public plan doesn’t work for you, then it is quite expensive
Label Studio - it’s open source
CVAT - there is a hosted, free version available at cvat.org, but you can self-host

Do you have any code snippets showing how to load a stereo audio file into MLMultiArray object?

It depends on the desired buffer layout of PCM data (interleaved or separate, int16 or float32, etc). A good starting point can be:

If you already have a buffer with the desired layout, you can use MLMultiArray.init(dataPointer:shape:dataType:strides:deallocator:) or MLShapedArrayinit(bytesNoCopy:shape:strides:deallocator:) . If you need to copy with some data type conversion (e.g. short to float32), MLShapedArray.init(unsafeUninitializedShape:initializingWith:) would work.

AAC audio, on the other hand, is often decoded to a sequence of audio chunks which is not aligned to one second boundary. So, we would need to do some (tedious) buffer munching. The following code loads an AAC file into MLShapedArray for every one second and write each back to a new AAC file.

MLShapedArray is a Swift-y cousin of MLMultiArray and, if you are using Swift, it is preferred over MLMultiArray. Core ML accepts either type. MLMultiArray(_ shapedArray:) and MLShapedArray(_ multiArray:) can convert between them.

```swift import AVFoundation import CoreML

let audioFormat = AVAudioFormat(commonFormat: .pcmFormatFloat32, sampleRate: 44100, channels: 2, interleaved: false)!

let frameCount = AVAudioFrameCount(audioFormat.sampleRate)

let inputURL = URL(filePath: "/Users/apple/sample.aac") let sourceAudioFile = try! AVAudioFile(forReading: inputURL) let sourceAudioBuffer = AVAudioPCMBuffer( pcmFormat: audioFormat, frameCapacity: frameCount )!

let aacSettings = [AVFormatIDKey : kAudioFormatMPEG4AAC, AVSampleRateKey : 44100, AVNumberOfChannelsKey : 2]

let outputURL = URL(filePath: "/Users/apple/output.aac") let outputAudioFile = try! AVAudioFile(forWriting: outputURL, settings: aacSettings)

// Loop to read and decode source audio file. while sourceAudioFile.framePosition < sourceAudioFile.length { try! sourceAudioFile.read(into: sourceAudioBuffer) let frameLength = Int(sourceAudioBuffer.frameLength)

// Make MLShapedArray from the audio buffer.
let leftChannels = MLShapedArray<Float32>(
    bytesNoCopy: sourceAudioBuffer.floatChannelData![0], 
    shape: [1, frameLength], 
    strides: [frameLength, 1], 
    deallocator: .none
)
let rightChannels = MLShapedArray<Float32>(
    bytesNoCopy: sourceAudioBuffer.floatChannelData![1], 
    shape: [1, frameLength], 
    strides: [frameLength, 1], 
    deallocator: .none
)
let audioShapedArr = MLShapedArray(
    concatenating: [leftChannels, rightChannels], 
    alongAxis: 0
)

// Write the MLShapedArray back to a audio buffer.
let outputAudioBuffer = AVAudioPCMBuffer(
    pcmFormat: audioFormat, 
    frameCapacity: sourceAudioBuffer.frameLength
)!
audioShapedArr[0].withUnsafeShapedBufferPointer { ptr, _, _ in
    outputAudioBuffer.floatChannelData![0].initialize(
        from: ptr.baseAddress!, 
        count: frameLength
    )
}
audioShapedArr[1].withUnsafeShapedBufferPointer { ptr, _, _ in
    outputAudioBuffer.floatChannelData![1].initialize(
        from: ptr.baseAddress!, 
        count: frameLength
    )
}
outputAudioBuffer.frameLength = sourceAudioBuffer.frameLength

// And encode and write to an AAC file.
try! outputAudioFile.write(from: outputAudioBuffer)

} ```

What image-sharpening torch model did you use in the "Optimize your Core ML usage" talk?

For the image sharpening model, we started with this Super Resolution PyTorch model and then made some custom modifications to fit our use case before converting to Core ML format using coremltools.

I'd like to classify larger article-sized bodies of text. One of the ways I'm working on doing this is by doing text classification. Given 5 categories of diary entry (eg family, health, spiritual, work, recreational), would it be preferred to use a single model that labels text with one of the 5? Or should I follow the SentimentClassifier example and use 5 separate models that each classify a string in 3 ways (notFamily, neutralFamily, isFamily)? If the latter, is this a use case for components?

Usually what I would tend to try first would be a single classifier model that labels text with one of your classes. That should be efficient and robust, and scale reasonably well with the number of classes. If you train multiple models, you would need to run each of them on each article you want to classify.

However, the questions being asked are subtly different in the two different cases. With a single model, you are asking which category or categories best match a given document. When you use separate models, you are asking whether a given document relates to a given category or not. The question you want to ask informs everything from your annotation to the type of model you use to the way you present your results.

Ultimately you will have to decide what analysis is best suited to your particular application.

Are there any limitations on number of IOSurface-backed buffers in a model?

No, Core ML doesn’t put any limitations.

Can the live text selected automatically (in a designated area) simply select all without user highlight?

Sorry, select all isn't an option currently. The user must manually select the text first. The only thing you can do is reset the selection if one exists.

Is it appropriate to try to use word embeddings to match long-form text up to single worded categories? For example, figuring out the distance between `"exercise"` and `"Today I decided to ride my bike to the store. I needed to get a workout in."` I'd like to match sentences and paragraphs up to to tags.

The most robust approach to this sort of categorization would be to pick a set of categories in advance, collect training data, and train a classifier to classify sentences according to these categories.

If you need to handle words outside the originally chosen set of categories, you could then use word embeddings to find an existing category similar to the entered word.

If you aren't able to train a model, things get a bit trickier. You can use tools such as part-of-speech tagging to identify relevant words in a sentence, e.g. nouns in the example you give, and determine how similar those are to the word you are trying to match. You would then need to figure out some way to take scores for individual words and form a score for an entire sentence.

Overall I think you would get better results by training a classifier, although it would require more work in advance for training.

Is it possible to have flexible shape (enumerated) inputs (and therefore outputs) to be compatible with outputBackings and IOSurface-backed MultiArray?

Yes this will work as long as the output backing buffer is the correct size corresponding to the size of the input.

One note is that being able to avoid data copies during inference for models with flexible shapes will vary depending on circumstances. You can use the Core ML Instrument and look in the Data lane to see if data copies are occurring.

Does Core ML have everything necessary to perform keyword extraction? How would you go about extracting keywords from articles of text?

Natural Language has a number of tools that can be useful in keyword extraction: tokenization, part-of-speech tagging, named entity recognition, gazetteers that could be used to identify stop words, and so on.

We don't provide an implementation of a specific keyword or keyphrase extraction algorithm, but there are algorithms that are sometimes used that take into account features such as frequency, co-occurrence statistics, TF-IDF, etc. that can be calculated from text that has been tokenized and processed using some of these tools.

Doing this fully unsupervised is a difficult task, though. You might be able to do better if you have some advance knowledge of the vocabulary that is relevant to the sort of text you will be working with.

Can optical flow be used in situations where more than one object is moving at the same time?

Yes, optical flow output is per pixel. Motion information will be returned for all parts of the image, and therefore for all moving objects in the scene.

Is there a document from that talks about how ML development works with Apple products and what is needed to get started?

A good starting point is to check out an overview of Apple’s ML focused development tools here:

https://developer.apple.com/machine-learning/

There are also some past WWDC videos which show you an example journey from idea to implementation such as this talk: Creating Great Apps Using Core ML and ARKit.

I highly recommend checking out this session on Friday: Explore the machine learning development experience

How do I handle situations where older ANE versions might not support certain layers and it will result in cpuAndNeuralEngine config being extremely slower on some devices?

MLComputeUnits.all is the default option and we recommend using that in most of the cases.

Core ML tries to optimize for latency while utilizing all the available compute units. MLComputeUnits.cpuAndNeuralEngine is helpful when your app is using GPU for pre or post processing and would like Core ML to not dispatch the model on GPU. Other than that MLComputeUnits.cpuAndNeuralEngine behaves very similar to MLComputeUnits.all.

If you have a model that is running much slower on certain devices, we recommend filing some feedback at http://feedbackassistant.apple.com with the model and specific device(s) included.

What is the difference between MLTrainingSessionParameters and MLObjectDetector.ModelParameters?

MLTrainingSessionParameters is for async training API, e.g., .train(), to specify training related parameters, such as checkpointing saving location, whereas MLObjectDetector.ModelParameters is for both sync and async training to specify model-specific parameters.

We can use Shipment Tracking Number, URL as a source for live text. Can we define our own source for the live text? Let’s say I wanna add detection of new couriers other than FedEx or UPS.

For the DataScannerViewController? Sorry, right now just the 1 option for shipment tracking numbers and it's whatever carrier we're able to detect

With VNDocumentCameraViewController, is it possible to limit it to just one scan, so that the user doesn't have to press "Save" at the end?

No, sorry. Would appreciate a Feedback for an enhancement request, though

I am looking to detect or classify a jersey number from a sporting event such as hockey in a video, I have tried VNRecongnizeTextRequest but do not get good results is there a better way to do such a task? Would I be better off creating my own model for this?

The results should have improved a bit using Revision3 of the VNRecognizeTextRequest. You could try that first.

Or you could train a custom classifier but that requires loads of images to get good results from that.

When text gets deformed on fabric or obscured it gets very difficult to read.

Clarifying question on what's new in Vision. I think v3 brings improved face recognition & barcode recognition & previews for those. Optical flow is entirely new, and the UI for text recognition through video is entirely new. Do I have this right? Anything else new in Vision?

Optical flow is not entirely new, there was already a prior revision 1 for optical flow.

You are correct about barcode, but face recognition is not offered by Vision.

You are also correct that the UI for text recognition is new.

Other things new in Vision are a new text recognition revision, and the new functionality in Xcode for Quick Look Preview support. We also deprecated older face detection and face landmarks revisions.

Up until iOS 15, the rectangle tracking VNTrackRectangleRequest returned precise corners of tracked rectangle. Since iOS 15, it seems to only return the bounding box. This is present even in the original detection/tracking demos. What is the suggested way to get tracked rectangles (and also support the original Vision framework to support iOS 11+)?

Have you tried the iOS16 beta?

What's the best way to create a "silhouette" video as opposed to a silhouette photo? Would Optical flow be best for this or sampling every frame for a silhouette or...

It depends! The two key considerations are:

how expensive is it to generate a silhouette a priori for every frame? If that’s cheap enough, it might be simpler and better to do that;
on the other hand, optical flow can help in frame-to-frame stability.

It’s really going to depend on surrounding context and performance requirements (both latency and accuracy).

What is your recommendation for using DataScannerViewController to detect money/currency values? DataScannerViewController.TextContentType does not appear to support money, currencies, or generic numbers (see FB10139138). The iOS Camera app supports money/currency detection in iOS 16. What is the best practice for me to implement a similar feature in my app? Should I recognize all text and then parse each recognized text item myself to determine if the string value contains number or currency amount?

Ah, good enhancement request. Currently we're not supporting currency.

You might be able to detect the presence of currency with UIDataDetectors, but you won't be able to highlight them.

Another option is to use capturePhoto() to take a still then use the Live Text APIs. That'll highlight all the data detector elements, not just money.

You could also add some business logic on top of Vision text recognition to dial in on numbers only, specific (relative) text size or even position of the text in the rectangle of the currency.

What is the difference between optical flow and a VNTrajectoryRequest? Would tracking a trajectory of an object benefit from a work flow that used both?

The trajectory request is especially developed to track objects on a trajectory meaning not any kind of zig zag path. Optical flow will detect any motion without the constraint of a trajectory

So if you want to track a ball been thrown, you will get better results with the trajectory request.

If you want to see if something moved in for instance security camera footage then you use optical flow.

V3 extends VNRecognizeTextRequest with automaticallyDetectsLanguage - If I turn this on, how do I discover what language it decided to use?

Vision will not tell you which languages have been detected. The intent of this is to allow the client to give a "hint" to the algorithm.

If you already know the language up-front, it's best to specify that language explicitly, which allows the framework to target that language for better accuracy. If you do not, it's better to set automaticallyDetectsLanguage to true, which essentially is communicating to the framework "I don't know which language" and the framework will do its best to decode any language.

You can use NLLanguageRecognizer to detect the dominate language after the text has been extracted by Vision.

Usually a sentence is sufficient to identify language. You can pass in as much as you like, but the algorithm limits the amount of text it will consider. Less than maybe 5-10 words is challenging.

If you have some prior information as to what the language might be, you can also pass hints and/or constraints to NLLanguageRecognizer.

How different are VNRecognizedTextObservations (returned by VNRecognizeTextRequest) to the RecognizedItem array returned by DataScannerViewController? Do they have the same information in them? Also, is the DataScannerViewController using the same VNRecognizeTextRequest (with revision3) in the background to process the results?

RecognizedItem contains a lot of the same information, like transcript, corners (RecognizedItem's are in view coordinates, however)... and RecognizedItem exposes the related Vision observation.

RecognizedItem however learns over time so the longer we see a text group, the more accurate the transcript will be. The Vision observation that is exposed, it is really just based on the last frame processed.

I cannot state which revision it uses, if any (sorry to be vague), but DataScanner supports the same languages as VNRecognizeTextRequest.

Can VNRecognizeTextRequest be used to perform text recognition on images with handwritten content or should it only be used for typed text (or very close to typed)?

Yes it can to some degree. It won't read my bad handwriting for sure but others will work. Handwriting is of course so varied that it really depends on the person who writes it.

Is live text powered by Vision APIs?

Live text is using Vision for its recognition work.

I saw on another thread that there is a new revision of VNRecognizeTextRequest (v3), but I can’t find it in the documentation: how can I enable it and can I “force” Vision to use a “minimum revision” (for example, 3 and later)?

From a non-Apple developer:

I handle code from Beta SDKs like this:

```swift

if defined(MAC_OS_VERSION_13_0) && MAC_OS_X_VERSION_MAX_ALLOWED >= MAC_OS_VERSION_13_0

warn "if you can see this, it's time to remove the #if."

if (@available(macOS 13.0, *))
{
    revision = VNRecognizeTextRequestRevision3;
}

endif

```

Last year you introduced the VNGeneratePersonSegmentationRequest. I know you can't comment on future plans, but it would be amazing if the new pet / object segmentation of iOS 16 was available to developers

It is always good to file feedback and explain what you are looking for.

Live Text seems to be added to UIImageView via `.addInteraction()`. Is there a way to add this interaction to SwiftUI's `Image`?

I don’t believe so. You can probably wrap a UIImageView in a ViewRepresentable - but of course it wouldn’t be a SwiftUI image anymore. I think we might have sample code that does something similar in the State of the Union donut app

Here's the relevant part in the Platforms State of the Union video.

But yeah, we definitely need SwiftUI support for the new VisionKit APIs 🙂 feedbacks may help

I'm part of a team that is building an app that is wishing to identify and recognise faces in a collection of photos. At the moment I've had success with Photos/Vision framework to find faces in photos and isolate them, but we're currently then sending those faces to AWS Amazon Rekognition service to help compare the face to a set of others and associate them to an existing face, or create a new face model. If I wanted to move this type of modeling onto the device itself (rather going through a network request to a 3rd party service), could you possibly guide me where to start? I'm assuming I could do the same thing locally on device using Apple frameworks?

We do not offer on-device face recognition solutions.

Generally speaking you would to either find (or train, if you have the data and the know-how) a face recognition model, which could then be run on-device through Core ML once converted into that format. Often such models return some descriptor, which can be compared to other similar descriptors to provide a distance. How best to measure that distance is often tied in to how the face recognition model was trained.

You may file a Feedback Assistant request if you'd like Apple to offer face recognition in the future.

We currently use a CoreML model with a C+ [sic] framework to handle initialization parameters in our processing queue (how long to hold an object, time an object should be in frame etc) and then run the ML model on the image captured with those parameters. Is Vision a better alternative than running our own initializers like that? Can we specify with Vision the retention time of images for processing images asynchronously? What is best practice there? Thank you!

Not sure about C+ [sic] in terms of its retention, but as long as you hold a VNImageRequestHandler, the image will be held.

My app iterates over the user's entire photo library using VNDetectHumanRectanglesRequest and VNRecognizeAnimalsRequest, in order to find all the photos containing humans and pets. For performance reasons, I'm only loading a small version of the photo. I've noticed that this (obviously) affects the results. Is there a recommended image size when using these requests? I'd also appreciate any other ideas on how to optimize the performance for such a task.

There is no hard and fast size that works for everything. The reason is that is limited by the ratio of the dog or human in respect to the image to be detected. So it depends on your use case if you for instance want to find a small dog in the background in a large panorama.

Is it possible to have player (end-user) enabled Machine Learning? For example in my game Follow the White Rabbit it would be helpful to adjust the model. For example supporting different hand sizes, skin tones, as well as support hands that had more/less than the standard number of fingers.

Yes, you can adapt a model on-device using any one of our ML frameworks, including Core ML, Create ML Components, MPSGraph, and BNNS. The approach you take depends on the data and problem you are working with.

To detect hand poses, I recommend checking out the sample code project Detecting Hand Poses with Vision.

If you foresee training on a small dataset, then it might be worth looking into using the KNN algorithm available in Core ML, check out the sample code project Personalizing a Model with On-Device Updates to learn more.

Finally, it is worth browsing through the documentation for the newly release API, Create ML Components.

Is the sample code for human action repetition counting available?

You can find the sample code here.

I tried the sample code on 6th gen. iPad Mini doing 20 jumper jacks at VARYING paces. I found delays and missing counts. The first few jumper jacks were always not being counted. I’m guessing the hard coded stride 5 and length 90 used for sliding window transformer may be the culprit. To me there isn’t a correct set of numbers to use because I have no control on how fast or slow my users will do his or her jumper jacks. Please advise.

A few things you may further check:

frame rate - the sample app should print out the frame rate as debug information in the Xcode console. Please check if it is roughly 30fps. If not, try to improve the environment lighting, charge the device, etc. and see if it improves.
body pose - please check if a single person’s full body pose is in the middle of the screen, and while the person moves, check if the poses are accurate (e.g., no missing joints or joints jumping everywhere, no visible delays of pose tracking, etc.)
ignored joints in JointsSelector - the initial setting here has 5 joints ignored (for demonstration of this transformer purpose). You may remove them if you are interested in the full body and all joints.
stride - determines how often (in terms of frame count) the counter is refreshed, and it can be set to other numbers. The length 90 however should not be changed. This is fixed for the model.
Downsampler transformer has a factor of 1 - which works best for actions close to ~1s per repetition. It can tolerate the varying speed to some extend. If your targeted action is typically much slower, you may set the factor to 2, or other numbers, this may increase the counter delay too. Unfortunately, you have to manually set it in the sample app for now.
You are also free to change some of the other logics in the sample app, such as how uiCount is rounded and reset, etc.

Correct me if I’m wrong. 1. Virtual HIIT fitness coach is not a good app idea for today. Action classifier can’t classify actions fast and accurate enough on mobile devices today? 2. The model was trained using 30 fps and a prediction windows size of 90 frames under the assumption that each human body action lasts about 3 seconds?

Action classifier is a model template that needs to be trained. So it is good with fitness actions that it was trained with, such as jumping jacks, squats, and some HIIT actions. Depending on your specific needs, we may further talk about how fast the actions could be and how accurate etc. topics. Some resources are also here:
This model is a separate model for counting actions (not action classifier back to WWDC20). It is class-agnostic, exposed via our API and does not need to be trained. It was trained with 30fps videos, and window size is 90 frames. But this is a completely different model, the window size 90 isn’t the same concept with action classifier’s window size. Within these 90 frames, multiple completed actions are OK (e.g., best with 2~4 action repetitions captured within the window). If you have videos or camera feed other than 30fps, you could choose to downsample, using Downsampler transformer. If your targeted actions are 30fps, but quite slower, such as push-ups, you could choose to do downsampling too. (edited)

Is it possible to train the model generated by MLRecommender on device when new data is available?

MLRecommender does not support on-device training/updating. However, I suggest you check our WWDC21 session below to build personalized recommendation-like experience into your app:

https://developer.apple.com/videos/play/wwdc2021/10037/

The MetalFX team has presented a very nice (classical) method for video upscaling. What is the potential of using MPS to achieve machine learning upscaling?

MPSGraph supports most common neural-network machine learning layers and operations so you should be able to create an upscaling network from the basic components, but MPSGraph doesn't have prebuilt graphs or networks so you would need to investigate and research the network architecture yourself, train it (using MPSGraph or other training frameworks) and deploy on MPSGraph.

One benefit of using MPSGraph is that you can pretty easily incorporate other Metal kernels (for example MPS image processing kernels or your own kernels) and encode them to the same Metal CommandQueue (or MPSCommandBuffer) to achieve low-latency, often zero-copy execution between the pre/post-processing kernels and the MPSGraph segment(s).

I’ve tried to run a resnet50 on PyTorch MPS backend, while running Mac Pro with 6900XT, and achieved 23% utilization, while 3090 was running 10 times as fast on the same code. Do you have ideas on why is this happening, and how to further optimize things on Radeon GPU’s?

Our current Proto release is focused on functionality and we have not tuned the performance yet. Do look out for performance improvements in the PyTorch nightly builds in the upcoming months.

For this particular case, we would like to know:

What’s the current PyTorch nightly you are using? Do update to latest and see if it still is giving bad utilization.
Can you share the network code?
Are there any operations falling back to the cpu? That hurts performance.
What’s the OS is it 12.3/12.4 or Ventura?

Do file an issue on PyTorch on GitHub and tag it with "module:mps". Also send it to us through FeedbackAssistant.

Is it possible to convert a PyTorch Text->Image model, such a VQGAN, to CoreML?

You can try using coremltools - https://coremltools.readme.io/docs/pytorch-conversion

I was trying to find something more specific for you, but couldn’t. Personally I haven’t tried VQGAN, but it seems like CLIP model can be converted. Here’s an issue that has been resolved regarding CLIP.

GPU acceleration and federated learning are two very appealing approaches for large scale training (or even training over the edge using multiple mobile devices). Is there some special provision in the MPSGraphs framework to enable/enhance such functionality?

MPSGraph should run just fine with iOS and iPadOS. There are no special pre-built functions that achieve techniques like PFL, but using for example the random-number generators provided by MPSGraph, one should be able to generate these operations from basic building-blocks.

Then as long as you can aggregate the gradients or other weight-updates across the network (something outside the scope of MPSGraph) you should be able to do this.

But again quite a bit of manual work is needed.

In "Accelerate machine learning with Metal" Drhuva referenced a new sample code for NeRFs at 14:08. But I can't find it anywhere :( P.S. Yaaay, NeRFs! :)

Hi, here is the link: Customizing a TensorFlow operation

While using MPS backend in PyTorch, I've found out that there is no way to select a GPU. This feature would be really beneficial while running Mac Pro with multiple GPU's.

We currently don’t have multi-GPU support.

Great session on image colorization, Geppy. Do you have any examples of user-customizable hand tracking? Think magic spells.

Interestingly, Geppy is a super hero as well. You can watch him demonstrate his powers with hand pose and action classification at the end of this session from last year:

Classify hand poses and actions with Create ML

Does coremltools support conversion of PyTorch Text-> Image models like CLIP? VQGAN?

Both these models (CLIP and VQGAN) are based on CNNs and transformer architectures, both of which should be supported.

In fact here is a resolved issue of a CLIP model conversion.

Note that, depending on the details, you may have to perform the pre-processing of the text input transformation to a tensor representation outside the PyTorch model given to the Core ML Tools convert API. The conversion operates on PyTorch models with tensor in tensor out interface.

I’d say just give the converter a try, and please take a look at some of the examples on the doc page and if you run into issues, post on the Github repo.

Hello! You mention searching for models in various "specialized" websites and such... do you have favorite places you've gone to find models?

Besides searching Github, you can use these sites:

Apple's Core ML Models Page
Papers With Code is also a fun place to browse
Hugging Face's Apple Models
Wolfram Neural Net Repository
MediaPipe - they are a bit of a hassle to convert to CoreML, but they try a lot of cutting edge stuff early on.
Modzy - a marketplace for models
Caffe Model Zoo
Awesome pretrained neural nets and other compressed knowledge

In "Explore the machine learning development experience", you mentioned re-training a few candidate replacement models before model integration. What's your process for deciding how many to try?

I tried architectures from other two scientific publications too. But then I decided to “re-work” a bit the architecture of the model I used in the session and decided to go with that.

The process can be different from model to model.

Explore the machine learning development experience

Will the source code for the project in "Explore the machine learning development experience" be available? It would be helpful to be able to dig in and change some things up to really understand the flow.

The code is not available at this time. Your interest and requested is appreciated and noted 🙂

Explore the machine learning development experience

Is it possible to dispatch a Core ML inference evaluation as part of a display or compute shader pipeline. Or do I need to wait for the CPU to be informed the frame has been rendered before dispatching from the CPU. Best of all would be if it could run on the ANE so that the GPU is free to work on the next frame.

If the output of GPU is in a IOSurface (in Float16 format), you can feed that to Core ML and let ANE work on it directly without any copies but, CPU does get triggered today for synchronizing these two computations.

Core ML doesn’t support MTLSharedEvent, if that’s what’s implied here.

Would you be able to file a feature request on feedbackassistant.apple.com with a little more details about your use case and may be some sample code on how you want to accomplish this? That would really help us push the API in the right direction.

Hi, I'm excited to see more information about optimizing recent models for Core ML including the `ane_transformers` repo. If I wanted to optimize eg CLIP for ANE, should I use code from that repo, or just try to take recommendations from the case study?

Yes I think using the code from the Apple Neural Engine (ANE) Transformers repo at the end of the Deploying Transformers on the Apple Neural Engine article is the best way to get started, and definitely follow along the recommendations of the article as well. Sounds like you are already on the right track!

The default conversion should be quite efficient as well for the neural engine (NE). With the new performance tab in Xcode 14, you will see whether the model is already neural engine resident or not.

There are details in the article on some of the changes specific to distilbert, which may or may not be required for the transformer architecture in CLIP.

There are more details to be found in the A Multi-Task Neural Architecture for On-Device Scene Analysis article as well, hope this helps.

In any case if you find any inefficiencies after conversion, feel free to share with us via a feedback request. We are constantly adding new converter and NE compiler optimizations to automatically detect patterns and map them efficiently to NE, so such feedback is very valuable!

In the "Optimize your Core ML usage" session, the presenter, Ben, explains that he got a latency of 22ms using the new performance metrics and that gives him a running frame rate of 45 frames per second. How did he come to that conclusion and how can I look at my performance metrics to determine our frames per second as well?

The number is just an upper bound estimate based on:

1000ms / 22ms = ~45 prediction per seconds

Such estimates often help us to understand the amount of headroom we can use for other operations while meeting the real time requirement (30 fps, etc).

Is it possible to run a Core ML model in the cloud/on Linux? We are using a Core ML model to power privacy-preserving, on-device features. But we want to offer a web-based demo to potential users, since downloading an app can be higher friction than just using a website.

No, it is not supported. It is an interesting use case. A feedback assistant report will be much appreciated!

As far as I know, multi-label image classification is not possible with Create ML. Is it possible with Create ML Components to create a multi-label classifier?

One option is to implement your own custom estimator using a framework like MPSGraph. To simplify the task (and data required), you may want to explore training the classifier on the features produced by a feature extractor, such as ImageFeaturePrint.

Is there custom operation support for PyTorch?

To learn more about GPU acceleration for PyTorch and TensorFlow, please refer to Accelerate machine learning with Metal

Specifically, PyTorch is open sourced, so you can leverage this to implement custom operations in Metal. Custom ops are also supported for TensorFlow as outlined in the session.

Do you have an example of how the ML image style transfer was created from an earlier session?

You can learn about how Create ML can help you build style transfer models with an example integration in this session:

Build Image and Video Style Transfer models in Create ML

There are also a wide variety of style transfer models online that can be converted to Core ML format with coremltools.

The model takes in an image and outputs an image. I believe the app is streaming data from an AVCaptureSession and running each frame through the Core ML model and outputting the result properly scaled back to the original image size.

Does Core ML benefit from two ANE's in M1 Ultra?

Yes when you are using multiple models or batch inference

Accelerate

When will Create ML support Neural Networks?

Just to clarify, it does support neural networks. For instance FullyConnectedNetworkClassifier.

But if you wanted to create a custom network you would need to use Metal or Accelerate.

Action Classifier

Audio

Backward Compatibility

Classifier

Clustering

What would be the best way to figure out which objects go together - say you have 10 groups of three and a pool of 100 ungrouped objects & you want to group them similarly?

From there, it's a classical unsupervised clustering problem, for which there are several approaches. Something like k-means is a quick and effective approach that might work in your case.

https://apple.github.io/turicreate/docs/userguide/clustering/kmeans.html

There is also a CIKMeans filter in Core Image. That may be exactly what you need.

Color Matching

Complex Numbers

Core Image

What would be the best way to figure out which objects go together - say you have 10 groups of three and a pool of 100 ungrouped objects & you want to group them similarly?

From there, it's a classical unsupervised clustering problem, for which there are several approaches. Something like k-means is a quick and effective approach that might work in your case.

https://apple.github.io/turicreate/docs/userguide/clustering/kmeans.html

There is also a CIKMeans filter in Core Image. That may be exactly what you need.

Core ML