It’s not data compression. Most machine learning, AI, and modeling programs do not use their training data once they’re in unrestricted use. But no data are compressed in the way that 7-Zip, mp3, etc compress it. There is no compressed dataset that can be re-expanded into its original form. AI and modeling programs store summarized and stratified data (which are just complex number sets generated from real world input like sound waves, rates of occurrence, indexed information, etc) in hierarchical lists. But these are not compressed files of the original data. If you’re interested, here’s what I hope is a simple explanation that will make sense. Ignore the rest if you have no interest or bore easily

_____________________________________
Machine learning and AI use data schemes in which all of the data on which they base their “intelligence” are sorted, stratified, and prioritized by the contributions of each bit of data to the accuracy of the program’s output (the “weight” of each individual parameter in the data set). An initial set of algorithms is defined based on deep analysis of those data using forms of regression, recursive analysis etc to organize, arborize, and make sense of the data. During training, the computer constantly compares its output to the training data and adjusts its algorithms on the fly. Reassessing its accuracy after each run helps it to prioritize (ie “weight”) the strongest contributing data most highly and reduce the weight of those that correlated less strongly with accuracy.
Once most machine learning models are trained, they are put into production and their training data are no longer accessed. So technically, they’re no longer learning once they become the kind of AI we use in smart phones, digital assistants, etc. They store algorithms and the volume of highest weighted parameters needed to achieve their target accuracy rates (as determined by their designers and engineers). Some models (eg KNN) do retain their training data and search it directly to answer queries. Most such programs do not.
Picture an AI program to create a fake book on the fly that’s to be used by X% (target TBD by the designers, based on their business model and practical reality) of gigging guitarists. Jazz groups, wedding bands, country players, Latin bands, studio pros, etc. Subscribers could have it create custom books to be displayed on tablets for any given gig in any location, eg a trio playing a second wedding at a Napa winery, a 9 piece commercial band playing a white shoe law firm’s Christmas party in Manhattan, or a country band playing a retirement party for the owner of the oldest grocery store in a small town in Oklahoma, etc. Accuracy would be measured by the percentage of tunes used by subscribers on actual gigs, along with the number of tunes omitted but requested by subscribers. Since the output would be a digital fakebook accessed over the internet, actual usage and all kinds of feedback would be incorporated into the evolution of the model.
Training data might include set lists for wedding bands, night club shows, etc plus DJ request lists, lists of gigs by band / type / region, online reviews of bands across the country (with detailed likes & dislikes) etc. The computer might first sort the data into what songs were most played, then most popular, best liked, most disliked, most requested by brides-to-be, etc and then place them in different gig settings in multiple geographic areas.
It would then identify discrepancies such as a song that’s played often but not requested often or a song that’s almost always played at small high end weddings but almost never at fire hall wedding receptions. The creators, designers and software engineers would have to decide what’s important and in what order these all appear in the algorithms. Will they consider ethnicity, total cost of the affair, etc? How finely will they stratify for geography - by region? state? city? neighborhood?
The data tables are then arranged by the program to reflect the initial priorities (“weights”) of all of these bits of information. A decision has to be made - how big will the book be: 100 tunes, 500 tunes, or a floating parameter based on variables like top X% of tunes for a given gig, location, etc? The designers, engineers etc then have to decide how deeply the program will go into each category, eg do they use the top 10% or 50% or all the data? Then the model trains by spitting out its first set of fakebooks and comparing them to all the categories for accuracy and consistency.
If that first book contains songs played by fewer than 15% of bands but often requested, and the bands that play it have 50+% more gigs in the same area and kind of gig than those that don’t, the machine may increase the weight of band popularity and decrease the weight of how often that song is played. This kind of analysis is applied to each and every parameter in the database until the output is consistent with the designers’ goals for it. That consistency includes the rate of accuracy they want.
There will still be errors, which in this example means tunes that bomb for some bands, omitted tunes that should have been included, etc. So the model will need real time feedback and periodic retraining both to improve accuracy and to incorporate changes like new hits, fads, resurgence of interest etc. Users could request a missing song from the stand. Models like KNN that retain the full training set can retrain at will, but the data may have to be augmented and readjusted offline for those changes to maximize utility.
The same kind of process is followed to create models of sound etc. The data set includes far more parameters that I can detail here, but they’re all collected and archived from digital representations of the entity to be modeled. IR yields data. The audio output signal from a preamp yields data. Then there’s the issue of variable parameters like EQ, which is often digitized by capturing the output of the preamp at all combinations are permutations of the EQ settings. There are even robotic devices that turn the EQ knobs on the actual source device infinitesimally to permit capture of the full EQ spectrum with hundreds or even thousands of takes of the same material
Once the first training set is collected, those data are categorized, collated, and prioritized by letting the program develop algorithms from which to build the modeled output. Outputs are compared to the original captured sound and everything is tweaked to improve accuracy. In this case, the goal is fidelity to the original sound. If a complete amp model is being created, the controls have to work just as they do on the real thing. Etc etc etc.
Guitar Amps built into case?
Yesterday, 05:03 PM in Guitar, Amps & Gizmos