Models

class fac.ModelToAnalyze(model: Module, model_cl: int, cl_model_clipped: int, thresholds: ndarray)

Model to analyze for frame alignment checks. Contains the module as well as metadata regarding the model.

Parameters:
  • model – The model to analyze. This model is assumed to output a 3-dimensional tensor of shape (N, T, 3) where N is the batch size, T is the sequence length, and 3 is the number of classes. The model is assumed to output logit probabilities (they ill be softmaxed internally).

  • model_cl – The context length of the model. This is the number of bases on each side of the central base that the model uses for prediction. This is used for padding the input sequences.

  • cl_model_clipped – The amount the model clips from each side. The amount the model clips from the input sequence. I.e.., if the input is of size (N, T, 4), the output will be of size (N, T - cl_model_clipped, 3). For some models, this is the same as model_cl (e.g., SpliceAI-400 has 400 for both), but for others, this is smaller (e.g., SAM-AM requires 5400nt of context but only clips 400 for efficiency).

  • thresholds – The calibration thresholds for the model. These are such that the model will predict the correct number of positive examples on average in each channel on a valadition set of interest. Shape: (2,), no threshold for the first channel. These thresholds should be in the range (0, 1), i.e., softmaxed probabilities.

fac.models.calibration_accuracy_and_thresholds(m, mcl, *, limit=None)

Compute calibration thresholds on the genes in the validation set. This is used internally for testing, and can be used by a user as well; though we recommend using a larger set of genes for calibration.

Parameters:
  • m – The model to compute calibration thresholds for. It is assumed to output a 3-dimensional tensor of shape (N, T, 3) where N is the batch size, T is the sequence length, and 3 is the number of classes. They are assumed to be log probabilities.

  • limit – The number of genes to use for calibration. If None, all genes will be used.

Returns:

thresholds: The calibration thresholds for the model. Will be of shape (2,). These thresholds

are such that the model will predict the correct number of positive examples on average in each channel