alt text

Besides the text contents processing, retrieving information and annotating inner objects in texts, in case of Machine Learning models it is also required to manage data subsets. In other words, there is a need to provide rules on how the series of documents is expected to be grouped in subsets. For example, we may declare subsets for: Training, Testing, Validating, etc. This post represents an AREkit tutorial devoted to Folding as a common type and the related concept on how data separation might be described and passed into other pipelines required so.

Tutorial code

In order to describe document separation format, AREkit-0.22.1 provides the BaseDataFolding type, which in short could be described in the following snippet:

class BaseDataFolding(object):
    # ... other contents
    def fold_doc_ids_set(self):
        raise NotImplementedError()

According to the implementation above, declaring your own folding required fold_doc_ids_set implemenation. This implementation considers that for a given set of data types, such as Train, Test, Dev, and many others, we declare a set of the related documents.

Folding Types

One of the common type of folidings is the predefined one, or fixed. In this type we consider a predefined separation of the document indices between predefined data types. This folding type might be initialized as follows:

parts = {
    DataType.Train: [0, 1, 2, 3],
    DataType.Test: [4, 5, 6, 7]
}
fixed_folding = FixedFolding.from_parts(parts)

The absence of folding at all could be declared as follows:

no_folding = NoFolding(doc_ids=[10, 15, 20], supported_data_type=DataType.Dev)

We also consider a combination of the different foldings by providing a UnitedFolding type. For a particular data type, supported by at least one folding parameter, it gathers the related set of documents and unify of all the documents behind every provided folding. UnitedFolding type could be initialized as follows:

united_folding = UnitedFolding([fixed_folding, no_folding])

The last folding supported by AREkit is a so-called k-fold Cross-Validational one. This folding assumes to distribute whole set of documents among k parts. Algorithm, which describes this distribution is a part of the so called Splitters.

AREkit provides two type of splitters out of the box. The first one (simple version) is consider a random separation by a given seed value. Splitter of this type could be initialized as follows:

splitter_simple = SimpleCrossValidationSplitter(shuffle=True, seed=1)

Another type is a statistical one, in which we rely on a certain measurements, i.e. statistics caclulated by a given document, in order to then consider this statistics for a balanced distribution. By default we provide a sentence-based statistics generation, which calculates an amount of sentences of a given document.

NOTE in present AREkit version we consider that this statistics is provided via file (stat.txt according to the snippet below)

As for doc_ops parameter, this parameter is related to DocumentOperation type, which we cover in the AREkit Tutorial: Craft your text-opinion annotation pipeline!.

doc_ops = FooDocumentOperations()
splitter_statistical = StatBasedCrossValidationSplitter(
    docs_stat=SentenceBasedDocumentStatGenerator(lambda doc_id: doc_ops.get_doc(doc_id)),
    docs_stat_filepath_func=lambda: "data/stat.txt")

Once any of the splitter types declared, it is possible to initialize Cross-Validational-based folding. AREkit provides a two class folding type, which could be adopted as follows:

cv_folding = TwoClassCVFolding(supported_data_types=[DataType.Train, DataType.Test],
                               doc_ids_to_fold=list(range(10)),
                               cv_count=3,
                               splitter=splitter_simple)

Display Data-Type Foldings

For all the folding types mentioned above, AREkit provides fold_to_doc_ids_set method which returns dictionary of foldings. The snippet below illustrates on how the folding could be obtained:

def show_folding(folding):
    assert(isinstance(folding, BaseDataFolding))
    split_dict = folding.fold_doc_ids_set()
    for data_type, doc_ids in split_dict.items():
        print(data_type, doc_ids)

In case of the k-fold Cross Validation there is an additional iter_states method which allows us to iterate over a different k states. Combining this call with a fuction declared above, all the possible data type foldings might be obtained as presented below:

for state_index, _ in enumerate(cv_folding.iter_states()):
    print("State: ", state_index)
    show_folding(cv_folding)         

Conclusion

Tutorial code

Thank you for reading this post! Now you’re able to choose and declare your folding in order to then pass it for samples generation. However, the latter is a part of another topic. Stay well and connected for updates.