VEP 蛇形包装器存在的问题
我在尝试运行 snakemake 的
VEP
包装器时遇到了两个问题。
第一个问题是,我想在
calls
中使用
lambda 通配符
,如下所示:
calling_dir = os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"])
callings_locations = [calling_dir] * len_samples
callings_dict = dict(zip(sample_names, callings_locations))
def getVCFs(sample):
return(list(os.path.join(callings_dict[sample],"{0}_sorted_dedupped_snp_varscan.vcf".format(sample,pair)) for pair in ['']))
rule variant_annotation:
input:
calls= lambda wildcards: getVCFs(wildcards.sample),
cache="resources/vep/cache",
plugins="resources/vep/plugins",
output:
calls="variants.annotated.vcf",
stats="variants.html"
params:
plugins=["LoFtool"],
extra="--everything"
message: """--- Annotating Variants."""
resources:
mem = 30000,
time = 120
threads: 4
wrapper:
"0.64.0/bio/vep/annotate"
但是,我收到一个错误:
当我将
lambda 通配符
替换为
calls= expand('{CALLING_DIR}/{CALLING_TOOL}/{sample}_sorted_dedupped_snp_varscan.vcf', CALLING_DIR=dirs_dict["CALLING_DIR"], CALLING_TOOL=config["CALLING_TOOL"], sample=sample_names)
时([这并不理想 - 请参阅此帖子了解原因][1]),它给出了有关
resources
的错误文件夹?
(snakemake) [moldach@cedar1 MTG353]$ snakemake -n -r
Building DAG of jobs...
MissingInputException in line 333 of /scratch/moldach/MADDOG/VCF-FILES/biostars439754/MTG353/Snakefile:
Missing input files for rule variant_annotation:
resources/vep/cache
resources/vep/plugins
我也 [对文档感到困惑,不知道它如何知道应该指定哪个参考基因组(版本等)][2]。
更新:
由于字符限制,我甚至无法回复两位受访者,因此我将在这里继续 问题 :
正如@jafors 提到的,两个包装器解决了
缓存
和
插件
的问题 - 谢谢!
现在,我尝试运行 VEP 时收到以下规则的错误:
rule variant_annotation:
input:
calls= expand('{CALLING_DIR}/{CALLING_TOOL}/{sample}_sorted_dedupped_snp_varscan.vcf', CALLING_DIR=dirs_dict["CALLING_DIR"], CALLING_TOOL=config["CALLING_TOOL"], sample=sample_names),
cache="resources/vep/cache",
plugins="resources/vep/plugins",
output:
calls=expand('{ANNOT_DIR}/{ANNOT_TOOL}/{sample}.annotated.vcf', ANNOT_DIR=dirs_dict["ANNOT_DIR"], ANNOT_TOOL=config["ANNOT_TOOL"], sample=sample_names),
stats=expand('{ANNOT_DIR}/{ANNOT_TOOL}/{sample}.html', ANNOT_DIR=dirs_dict["ANNOT_DIR"], ANNOT_TOOL=config["ANNOT_TOOL"], sample=sample_names)
params:
plugins=["LoFtool"],
extra="--everything"
message: """--- Annotating Variants."""
resources:
mem = 30000,
time = 120
threads: 4
wrapper:
"0.64.0/bio/vep/annotate"
这是我从日志中得到的错误:
Building DAG of jobs...
Using shell: /cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 variant_annotation
1
[Wed Aug 12 20:22:49 2020]
Job 0: --- Annotating Variants.
Activating conda environment: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f
Traceback (most recent call last):
File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/scripts/tmpwx1u_776.wrapper.py", line 36, in <module>
if snakemake.output.calls.endswith(".vcf.gz"):
AttributeError: 'Namedlist' object has no attribute 'endswith'
[Wed Aug 12 20:22:53 2020]
Error in rule variant_annotation:
jobid: 0
output: ANNOTATION/VEP/BC1217.annotated.vcf, ANNOTATION/VEP/470.annotated.vcf, ANNOTATION/VEP/MTG109.annotated.vcf, ANNOTATION/VEP/BC1217.html, ANNOTATION/VEP/470.html, ANNOTATION/VEP/MTG$
conda-env: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f
RuleException:
CalledProcessError in line 393 of /scratch/moldach/MADDOG/VCF-FILES/biostars439754/Snakefile:
Command 'source /home/moldach/miniconda3/bin/activate '/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f'; set -euo pipefail; python /scratch/moldach/MADDOG/VCF-FILE$
File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/Snakefile", line 393, in __rule_variant_annotation
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.8.0/lib/python3.8/concurrent/futures/thread.py", line 57, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
要清楚 :
这是我在尝试包装器之前运行 VEP 的代码,因此我想保留类似的选项( 例如 离线, 等等 ):
vep \
-i {input.sample} \
--species "caenorhabditis_elegans" \
--format "vcf" \
--everything \
--cache_version 100 \
--offline \
--force_overwrite \
--fasta {input.ref} \
--gff {input.annot} \
--tab \
--variant_class \
--regulatory \
--show_ref_allele \
--numbers \
--symbol \
--protein \
-o {params.sample}
更新 2:
是的,使用
expand()
是个问题。我记得这就是为什么我喜欢使用
lambda
或
os.path.join()
作为规则
input/output
,除了您在
rule all
中提到的情况之外:
下面的方法似乎可以解决该问题,但我遇到了一个新问题:
rule variant_annotation:
input:
calls= lambda wildcards: getVCFs(wildcards.sample),
cache="resources/vep/cache",
plugins="resources/vep/plugins",
output:
calls=os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.annotated.vcf"),
stats=os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.html")
不确定为什么我会收到
unknown file type
错误 - 正如我提到的那样,这是首先使用具有相同输入数据的完整命令进行测试的?
Activating conda environment: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f
Failed to open VARIANT_CALLING/varscan/MTG109_sorted_dedupped_snp_varscan.vcf: unknown file type
Possible precedence issue with control flow operator at /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
Traceback (most recent call last):
File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/scripts/tmpsh388k23.wrapper.py", line 44, in <module>
"(bcftools view {snakemake.input.calls} | "
File "/home/moldach/bin/snakemake/lib/python3.8/site-packages/snakemake/shell.py", line 156, in __new__
raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail; (bcftools view VARIANT_CALLING/varscan/MTG109_sorted_dedupped_snp_varscan.vcf | vep --everything --fork 4 --format vcf --vcf --cach$
[Thu Aug 13 09:02:22 2020]
更新 3:
bcftools view
发出来自
samtools mpileup
/
varscan pillup2snp
输出的警告:
def getDeduppedBamsIndex(sample):
return(list(os.path.join(aligns_dict[sample],"{0}.sorted.dedupped.bam.bai".format(sample,pair)) for pair in ['']))
rule mpilup:
input:
bam=lambda wildcards: getDeduppedBams(wildcards.sample),
reference_genome=os.path.join(dirs_dict["REF_DIR"],config["REF_GENOME"])
output:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.mpileup.gz"),
log:
os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"{sample}_{contig}_samtools_mpileup.log")
params:
extra=lambda wc: "-r {}".format(wc.contig)
resources:
mem = 1000,
time = 30
wrapper:
"0.65.0/bio/samtools/mpileup"
rule mpileup_to_vcf:
input:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.mpileup.gz"),
output:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.vcf")
message:
"Calling SNP with Varscan2"
threads:
2 # Keep threading value to one for unzipped mpileup input
# Set it to two for zipped mipileup files
log:
os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"varscan_{sample}_{contig}.log")
resources:
mem = 1000,
time = 30
wrapper:
"0.65.0/bio/varscan/mpileup2snp"
rule vcf_merge:
input:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_I.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_II.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_III.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_IV.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_V.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_X.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_MtDNA.vcf")
output:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}.vcf")
log: os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"{sample}_vcf-merge.log")
resources:
mem = 1000,
time = 10
threads: 1
message: """--- Merge VarScan by Chromosome."""
shell: """
awk 'FNR==1 && NR!=1 {{ while (/^<header>/) getline; }} 1 {{print}} ' {input} > {output}
"""
calling_dir = os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"])
callings_locations = [calling_dir] * len_samples
callings_dict = dict(zip(sample_names, callings_locations))
def getVCFs(sample):
return(list(os.path.join(callings_dict[sample],"{0}.vcf".format(sample,pair)) for pair in ['']))
rule annotate_variants:
input:
calls=lambda wildcards: getVCFs(wildcards.sample),
cache="resources/vep/cache",
plugins="resources/vep/plugins",
output:
calls="{sample}.annotated.vcf",
stats="{sample}.html"
params:
# Pass a list of plugins to use, see https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html
# Plugin args can be added as well, e.g. via an entry "MyPlugin,1,FOO", see docs.
plugins=["LoFtool"],
extra="--everything" # optional: extra arguments
log:
"logs/vep/{sample}.log"
threads: 4
resources:
time=30,
mem=5000
wrapper:
"0.65.0/bio/vep/annotate"
如果我运行
bcftools在输出中我收到错误:
$ bcftools view variant_calling/varscan/MTG324.vcf
Failed to read from variant_calling/varscan/MTG324.vcf: unknown file type
-
关于使用 expand 还是通配符,这完全无关紧要。biostar 帖子只是建议如何保持内容的可读性。在 snakemake/programmatic 方面,只要输入正确,定义输入的方式就无关紧要。
-
关于资源的投诉是,您在规则
variant_annotation
的输入中定义resources/vep/cache
和resources/vep/plugins
是运行variant_annotation
所必需的输入。出现此错误时,snakemake 实际上是在告诉您这些文件不存在,因此它无法为您运行规则。 -
当我查看文档中的代码时,似乎缓存目录作为输入应该定义您使用的基因组:
entrypath = get_only_child_dir(get_only_child_dir(Path(cache)))
species = entrypath.parent.name
release, build = entrypath.name.split("_")
除了 Maarten 所说的内容(
resources/vep/cache
和
resources/vep/plugins
只是所需输入的示例路径,还定义了您要使用的基因组和版本),您可以使用以下包装器通过 Snakefile 中的另外两个简单规则轻松获取缓存和插件目录:
- https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/vep/cache.htm
- https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/vep/plugins.html
编辑
很高兴这解决了您的第一个问题。
第二个错误似乎来自输出中的
expand
。
我是否理解正确,您想逐个注释所有 vcf?因此输入是
{sample}.vcf
,输出将是
{sample}.annotated.vcf
?
如果是这种情况,您可能不想在此规则中使用
expand
。
我也不确定,为什么您需要
{ANNOT_DIR
和
{ANNOT_TOOL
作为通配符。我猜如果你使用 VEP,
ANNOT_TOOL
会一直是
VEP
,而
ANNOT_DIR
会是
ANNOTATION
?
然后,你可以直接在输出中将它们写为
ANNOTATION/VEP/{sample}.annotated.vcf
。
对于
{CALLING_DIR
也是一样,我猜这会一直是同一个目录,对吧?我知道如果您在样本上使用多个调用者,则
{CALLING_TOOL} 可能有多个值。
如果我没有误解,那么在使用 VEP 时,您可能想要扩展两个通配符,即
{sample} 和
{CALLING_TOOL}。
只需写入
input:
calls: 'CALLDIR/{CALLING_TOOL}/{sample}_sorted_dedupped_snp_varscan.vcf',
cache="resources/vep/cache",
plugins="resources/vep/plugins"
output:
calls='ANNOTATION/VEP/{CALLING_TOOL}/{sample}.annotated.vcf',
stats='ANNOTATION/VEP/{CALLING_TOOL}/{sample}.html'
expand
属于您的规则
all
或任何其他同时使用所有带注释的 vcf 的目标规则,等等。像这样:
rule all:
input: expand('ANNOTATION/VEP/{CALLING_TOOL}/{sample}.annotated.vcf', CALLING_TOOL=config["CALLING_TOOL"], sample=sample_names)
然后,
variant_annotation
规则将运行您在规则
all
中扩展的所有样本。
我希望我正确理解了您的想法,这会有所帮助。
EDIT2
好的,似乎我们快完成了。您收到的错误是由
bcftools view
引发的 - 它表示 vcf 可能有问题。
您是否在 Snakefile 之外尝试过
bcftools view
与您的 vcf?这将让我们知道问题是在此规则期间出现的,还是 vcf 已经存在某种问题。