Training data

As we mentioned in module 1, Large Language Models are training using existing examples of digital files which software companies have selected to be representative of data humans might want to read, see, produce, or edit. In 2023, we don’t know what is in this training data, but this situation might change.

Many people have raised concerns about what types and ranges of material may have been used in the existing examples. These include:

Is the training data biased towards a particular interest or viewpoint?
Are all languages and cultures equally represented?
Are copyrighted materials included in the training data, and if so, does the software company have permission to alter them?