Config File
The config file is a yaml file that contains the configuration.You can use one config file per MPRA design. It is divided into reference (reference sequence), datasets (different input files to design), tiling (tiling strategies of predifined regions with variants), and oligo_design (filtering and adapters). This is a full example file with all possible configurations. config/example_config.yml.
1---
2reference:
3 genome: /data/cephfs-1/work/projects/cubit/current/reference/hg38/ucsc/hg38.fa.genome
4 fasta: /data/cephfs-1/work/projects/cubit/current/reference/hg38/ucsc/hg38.fa
5
6datasets:
7 variants_only:
8 - design.control_variants.tsv
9 variants_regions:
10 - design.samples_combined.tsv
11 - design.control_variants_regions.tsv
12 regions_only:
13 - design.control_regions.tsv
14 sequences_only:
15 - design.control_sequences.tsv
16
17
18oligo_length: 270
19
20tiling:
21 remove_edge_variants: true
22 min_overlap: 50
23 strategies:
24 centering:
25 max: 270
26 two_tiles:
27 max: 350 # or 2*270 -2 *50 - 2*25
28 include_variant_edge: true
29 variant_edge_exclusion: 25
30oligo_design:
31 variants:
32 use_most_centered_region: false
33 remove_unused_regions: false
34 filtering:
35 max_homopolymer_length: 10
36 max_simple_repeat_fraction: 0.25
37 adapters:
38 left: AGGACCGGATCAACT
39 right: CATTGCGTGAACCGA
Note that teh config file is conrolled by json schema. This means that the config file is validated against the schema. If the config file is not valid, the program will exit with an error message. The schema is located in workflow/schemas/config.schema.yaml.
Reference settings
The referebce settings are located in the reference section. The following settings are possible:
reference:
type: object
description: reference genome
properties:
genome:
type: string
description: Path to genome file
fasta:
type: string
description: Path to fasta file
- genome:
Genome file with lengths of contigs. The full or relative path to the file should be used.
- fasta:
Genome fasta file (indexed with samtools faidx). The full or relative path to the file should be used.
Datasets settings
The assignment workflow is configured in the datasets section. The following settings are possible:
datasets:
type: object
description: datasets to be processed
properties:
variants_only:
description: "List of tsv files with variants and samples to be processed"
type: array
items:
type: string
regions_only:
description: "List of tsv files with regions and samples to be processed"
type: array
items:
type: string
variants_regions:
description: "List of tsv files with variants, regions and samples to be processed"
type: array
items:
type: string
sequences_only:
description: "List of tsv files with sequences to be processed"
type: array
items:
type: string
Each part conntains now a tsv file with a list of samples to include and where the files are located.
- variants_only:
Samples with only variants (vcf file). Will center the region around the variant.
- regions_only:
Samples with only regions (bed file). Region length must have the exact size of the oligo design.
- variants_regions:
Samples with variants and regions (vcf and bed file). Here the tiling strategy takes place (see below)
- sequences_only:
Only sequences (fasta file). Sequences must have the exact size of the oligo design.
Oligo length
The experiment workflow is configured in the oligo_length section. Each experiment run (contains one experiment file with all replicates of an experiment). The following settings are possible:
oligo_length:
type: integer
description: "Length of oligos to be designed"
minimum: 1
- end_oligo_length:
Length of the oligo (excluding adapters).
Tiling strategy
The tiling strategy is configured in the tiling section. The following settings are possible:
tiling:
type: object
description: "Parameters for tiling strategy"
properties:
remove_edge_variants:
type: boolean
description: "Whether to remove variants that are too close to the edges of regions"
variant_edge_exclusion:
type: integer
description: "Number of bases to exclude variants on oligo edges"
minimum: 0
default: 0
min_overlap:
type: integer
description: "Minimum overlap between adjacent oligos"
minimum: 0
strategies:
type: object
description: "Tiling strategies to be used"
properties:
centering:
type: object
description: "Parameters for centering strategy"
properties:
max:
type: integer
description: "Maximum length of the region for design used for centering of the oligo. If region is longer two tiles are used."
minimum: 1
required:
- max
two_tiles:
type: object
description: "Parameters for two_tiles strategy"
properties:
max:
type: integer
description: "Maximum length of the region for design used for two_tiles strategy. If region is longer than than this value a multiple tiling is used using the min_overlap."
minimum: 1
include_variant_edge:
type: boolean
description: "Whether to include variant edge in tiling"
required:
- max
- include_variant_edge
required:
- centering
- two_tiles
required:
- remove_edge_variants
- min_overlap
- strategies
- variant_edge_exclusion
- remove_edge_variants:
If set to true, variants that are too close to the edge of an oligo will be removed.
- variant_edge_exclusion:
Number of bases to exclude variants on edges of oligos when
remove_edge_variantsis true.- min_overlap:
Minimum overlap of variant to region to include the variant.
- strategies:
Different tiling strategies to use. See below for details.
- centering:
Center the variant in the oligo. Max defines the maximum length of a region where an oligo should be centered.
- two_tiles:
Use two tiles per variant. Max defines the maximum length of a region where two tiles should be used.
Oligo design
The final oligo design (filtering etc.) is configured in the oligo_design section. The following settings are possible:
oligo_design:
type: object
description: "Parameters for oligo design"
properties:
variants:
type: object
description: "Parameters for variant handling during oligo design"
properties:
use_most_centered_region:
type: boolean
description: "If variants fits to multiple oligos use the one where it is most centered. Otherwise all are used."
remove_unused_regions:
type: boolean
description: "Whether to remove regions that are not used for any variant"
filtering:
type: object
description: "Parameters for filtering designed oligos"
properties:
max_homopolymer_length:
type: integer
description: "Maximum allowed homopolymer length in designed oligos"
minimum: 1
max_simple_repeat_fraction:
type: number
description: "Maximum allowed fraction of simple repeats in designed oligos"
minimum: 0.0
maximum: 1.0
adapters:
type: object
description: "Adapter sequences to be added to designed oligos"
properties:
left:
type: string
description: "Left adapter sequence"
right:
type: string
description: "Right adapter sequence"
- variants:
Parameters for variant handling during oligo design
- use_most_centered_region:
If variants fits to multiple oligos use the one where it is most centered. Otherwise all are used.”
- remove_unused_regions:
Whether to remove regions that are not used for any variant
- filtering:
Parameters for filtering of oligos
- max_homopolymer_length:
Maximum allowed homopolymer length in designed oligos
- max_simple_repeat_fraction:
Maximum allowed fraction of simple repeats in designed oligos
- adapters:
Parameters for adding adapters to oligos
- left:
Forward adapter sequence to add to the oligos
- right:
Reverse adapter sequence to add to the oligos