Config File

The config file is a yaml file that contains the configuration.You can use one config file per MPRA design. It is divided into reference (reference sequence), datasets (different input files to design), tiling (tiling strategies of predifined regions with variants), and oligo_design (filtering and adapters). This is a full example file with all possible configurations. config/example_config.yml.

 1---
 2reference:
 3  genome: /data/cephfs-1/work/projects/cubit/current/reference/hg38/ucsc/hg38.fa.genome
 4  fasta: /data/cephfs-1/work/projects/cubit/current/reference/hg38/ucsc/hg38.fa
 5
 6datasets:
 7  variants_only:
 8    - design.control_variants.tsv
 9  variants_regions:
10    - design.samples_combined.tsv
11    - design.control_variants_regions.tsv
12  regions_only:
13    - design.control_regions.tsv
14  sequences_only:
15    - design.control_sequences.tsv
16
17
18oligo_length: 270
19
20tiling:
21  remove_edge_variants: true
22  min_overlap: 50
23  strategies:
24    centering:
25      max: 270
26    two_tiles:
27      max: 350 # or 2*270 -2 *50 - 2*25
28      include_variant_edge: true
29  variant_edge_exclusion: 25
30oligo_design:
31  variants:
32    use_most_centered_region: false
33    remove_unused_regions: false
34  filtering:
35    max_homopolymer_length: 10
36    max_simple_repeat_fraction: 0.25
37  adapters:
38    left: AGGACCGGATCAACT
39    right: CATTGCGTGAACCGA

Note that teh config file is conrolled by json schema. This means that the config file is validated against the schema. If the config file is not valid, the program will exit with an error message. The schema is located in workflow/schemas/config.schema.yaml.

Reference settings

The referebce settings are located in the reference section. The following settings are possible:

  reference:
    type: object
    description: reference genome
    properties:
      genome:
        type: string
        description: Path to genome file
      fasta:
        type: string
        description: Path to fasta file
genome:

Genome file with lengths of contigs. The full or relative path to the file should be used.

fasta:

Genome fasta file (indexed with samtools faidx). The full or relative path to the file should be used.

Datasets settings

The assignment workflow is configured in the datasets section. The following settings are possible:

  datasets:
    type: object
    description: datasets to be processed
    properties:
      variants_only:
        description: "List of tsv files with variants and samples  to be processed"
        type: array
        items:
          type: string
      regions_only:
        description: "List of tsv files with regions and samples to be processed"
        type: array
        items:
          type: string
      variants_regions:
        description: "List of tsv files with variants, regions and samples  to be processed"
        type: array
        items:
          type: string
      sequences_only:
        description: "List of tsv files with sequences to be processed"
        type: array
        items:
          type: string

Each part conntains now a tsv file with a list of samples to include and where the files are located.

variants_only:

Samples with only variants (vcf file). Will center the region around the variant.

regions_only:

Samples with only regions (bed file). Region length must have the exact size of the oligo design.

variants_regions:

Samples with variants and regions (vcf and bed file). Here the tiling strategy takes place (see below)

sequences_only:

Only sequences (fasta file). Sequences must have the exact size of the oligo design.

Oligo length

The experiment workflow is configured in the oligo_length section. Each experiment run (contains one experiment file with all replicates of an experiment). The following settings are possible:

  oligo_length:
    type: integer
    description: "Length of oligos to be designed"
    minimum: 1
end_oligo_length:

Length of the oligo (excluding adapters).

Tiling strategy

The tiling strategy is configured in the tiling section. The following settings are possible:

  tiling:
    type: object
    description: "Parameters for tiling strategy"
    properties:
      remove_edge_variants:
        type: boolean
        description: "Whether to remove variants that are too close to the edges of regions"
      variant_edge_exclusion:
        type: integer
        description: "Number of bases to exclude variants on oligo edges"
        minimum: 0
        default: 0
      min_overlap:
        type: integer
        description: "Minimum overlap between adjacent oligos"
        minimum: 0
      strategies:
        type: object
        description: "Tiling strategies to be used"
        properties:
          centering:
            type: object
            description: "Parameters for centering strategy"
            properties:
              max:
                type: integer
                description: "Maximum length of the region for design used for centering of the oligo. If region is longer two tiles are used."
                minimum: 1
            required:
              - max
          two_tiles:
            type: object
            description: "Parameters for two_tiles strategy"
            properties:
              max:
                type: integer
                description: "Maximum length of the region for design used for two_tiles strategy. If region is longer than than this value a multiple tiling is used using the min_overlap."
                minimum: 1
              include_variant_edge:
                type: boolean
                description: "Whether to include variant edge in tiling"
            required:
              - max
              - include_variant_edge
        required:
          - centering
          - two_tiles
    required:
      - remove_edge_variants
      - min_overlap
      - strategies
      - variant_edge_exclusion
remove_edge_variants:

If set to true, variants that are too close to the edge of an oligo will be removed.

variant_edge_exclusion:

Number of bases to exclude variants on edges of oligos when remove_edge_variants is true.

min_overlap:

Minimum overlap of variant to region to include the variant.

strategies:

Different tiling strategies to use. See below for details.

centering:

Center the variant in the oligo. Max defines the maximum length of a region where an oligo should be centered.

two_tiles:

Use two tiles per variant. Max defines the maximum length of a region where two tiles should be used.

Oligo design

The final oligo design (filtering etc.) is configured in the oligo_design section. The following settings are possible:

  oligo_design:
    type: object
    description: "Parameters for oligo design"
    properties:
      variants:
        type: object
        description: "Parameters for variant handling during oligo design"
        properties:
          use_most_centered_region:
            type: boolean
            description: "If variants fits to multiple oligos use the one where it is most centered. Otherwise all are used."
          remove_unused_regions:
            type: boolean
            description: "Whether to remove regions that are not used for any variant"
      filtering:
        type: object
        description: "Parameters for filtering designed oligos"
        properties:
          max_homopolymer_length:
            type: integer
            description: "Maximum allowed homopolymer length in designed oligos"
            minimum: 1
          max_simple_repeat_fraction:
            type: number
            description: "Maximum allowed fraction of simple repeats in designed oligos"
            minimum: 0.0
            maximum: 1.0
      adapters:
        type: object
        description: "Adapter sequences to be added to designed oligos"
        properties:
          left:
            type: string
            description: "Left adapter sequence"
          right:
            type: string
            description: "Right adapter sequence"
variants:

Parameters for variant handling during oligo design

use_most_centered_region:

If variants fits to multiple oligos use the one where it is most centered. Otherwise all are used.”

remove_unused_regions:

Whether to remove regions that are not used for any variant

filtering:

Parameters for filtering of oligos

max_homopolymer_length:

Maximum allowed homopolymer length in designed oligos

max_simple_repeat_fraction:

Maximum allowed fraction of simple repeats in designed oligos

adapters:

Parameters for adding adapters to oligos

left:

Forward adapter sequence to add to the oligos

right:

Reverse adapter sequence to add to the oligos