Integrate description from one file in another XML file in Perl -
i'm new here apologize bad english. have 2 files (file 1: main-xml-file , file 2: description-file) , want integrate description line per line in specific position (replace xx in hit_def) in xml-file.
file 1: here xml-tree:
<blastoutput> <blastoutput_iterations> <iteration> (gene 1) <iteration_hits> <hit> (1-10) <hit_def> <iteration> (gene 2) <iteration_hits> <hit> (1-10) <hit_def>
and here first , last lines, because file 5 gb big:
<?xmlversion="1.0"?> <blastoutput> <blastoutput_program>rapsearch</blastoutput_program> <blastoutput_version>rapsearch2</blastoutput_version> <blastoutput_reference>yonganzhao,haixutangandyuzhenye.rapsearch2:afastandmemory-efficientproteinsimilaritysearchtoolfornextgenerationsequencingdata.bioinformatics2012,28(1):125-126</blastoutput_reference> <blastoutput_db>/mreferate/dwolff/rapsearch2.23/db/ncbi_nr_dec15</blastoutput_db> <blastoutput_param> <parameters> </parameters> </blastoutput_param> <blastoutput_iterations> <iteration> <iteration_iter-num>1</iteration_iter-num> <iteration_query-def>gene_id_1</iteration_query-def> <iteration_query-len>37</iteration_query-len> <iteration_hits> <hit> <hit_num>1</hit_num> <hit_id>gi|939543432|gb|kpv42113.1|</hit_id> <hit_def>xx</hit_def> <hit_accession>kpv42113.1</hit_accession> <hit_len>162</hit_len> <hit_hsps> <hsp> <hsp_num>1</hsp_num> <hsp_bit-score>58.151</hsp_bit-score> <hsp_score>139</hsp_score> <hsp_evalue>-5.6061</hsp_evalue> <hsp_query-from>1</hsp_query-from> <hsp_query-to>37</hsp_query-to> <hsp_hit-from>54</hsp_hit-from> <hsp_hit-to>90</hsp_hit-to> <hsp_query-frame>0</hsp_query-frame> <hsp_identity>28</hsp_identity> <hsp_positive>33</hsp_positive> <hsp_align-len>37</hsp_align-len> <hsp_qseq>mvvcdepvsaldvsvqaavltllveiqqqhetamili</hsp_qseq> <hsp_hseq>lvlcdepvsaldvsvqaavlnllleiqrehgttmifi</hsp_hseq> <hsp_midline>+v+cdepvsaldvsvqaavlll+eiq++htmii</hsp_midline> </hsp> </hit_hsps> </hit> <hit> <hit_num>2</hit_num> <hit_id>gi|385280362|gb|eif44286.1|</hit_id> <hit_def>xx</hit_def> <hit_accession>eif44286.1</hit_accession> <hit_len>327</hit_len> <hit_hsps> <hsp> <hsp_num>1</hsp_num> <hsp_bit-score>54.6842</hsp_bit-score> <hsp_score>130</hsp_score> <hsp_evalue>-4.56249</hsp_evalue> <hsp_query-from>1</hsp_query-from> <hsp_query-to>37</hsp_query-to> <hsp_hit-from>169</hsp_hit-from> <hsp_hit-to>205</hsp_hit-to> <hsp_query-frame>0</hsp_query-frame> <hsp_identity>24</hsp_identity> <hsp_positive>31</hsp_positive> <hsp_align-len>37</hsp_align-len> <hsp_qseq>mvvcdepvsaldvsvqaavltllveiqqqhetamili</hsp_qseq> <hsp_hseq>lvicdepvsaldvsvqaqiinllqelqtehntamlfi</hsp_hseq> <hsp_midline>+v+cdepvsaldvsvqa++lle+q+htam+i</hsp_midline> </hsp> </hit_hsps> </hit> <hit> <hit_num>3</hit_num> <hit_id>gi|550913550|ref|wp_022666548.1|</hit_id> <hit_def>xx</hit_def> <hit_accession>wp_022666548.1</hit_accession> <hit_len>721</hit_len> <hit_hsps> <hsp> <hsp_num>1</hsp_num> <hsp_bit-score>53.5286</hsp_bit-score> <hsp_score>127</hsp_score> <hsp_evalue>-4.21462</hsp_evalue> <hsp_query-from>1</hsp_query-from> <hsp_query-to>37</hsp_query-to> <hsp_hit-from>549</hsp_hit-from> <hsp_hit-to>585</hsp_hit-to> <hsp_query-frame>0</hsp_query-frame> <hsp_identity>27</hsp_identity> <hsp_positive>31</hsp_positive> <hsp_align-len>37</hsp_align-len> <hsp_qseq>mvvcdepvsaldvsvqaavltllveiqqqhetamili</hsp_qseq> <hsp_hseq>mvicdepvsaldvsvqaavlnllneikeemgttmifi</hsp_hseq> <hsp_midline>mv+cdepvsaldvsvqaavlllei+++tmii</hsp_midline> </hsp> </hit_hsps> </hit> ... </iteration_hits> <iteration_stat> <statistics> <statistics_db-num>77704984</statistics_db-num> <statistics_db-len>28292933896</statistics_db-len> <statistics_hsp-len>0</statistics_hsp-len> <statistics_eff-space>0</statistics_eff-space> <statistics_kappa>0.041</statistics_kappa> <statistics_lambda>0.267</statistics_lambda> <statistics_entropy>0.14</statistics_entropy> </statistics> </iteration_stat> </iteration> </blastoutput_iterations> </blastoutput>
file 2:
peptide abc transporter atpase, partial [kouleothrix aurantiaca] oligopeptide abc transporter [gamma proteobacterium bdw918] abc transporter atp-binding protein [desulfospira joergensenii]
output should be:
<?xmlversion="1.0"?> <blastoutput> <blastoutput_program>rapsearch</blastoutput_program> <blastoutput_version>rapsearch2</blastoutput_version> <blastoutput_reference>yonganzhao,haixutangandyuzhenye.rapsearch2:afastandmemory-efficientproteinsimilaritysearchtoolfornextgenerationsequencingdata.bioinformatics2012,28(1):125-126</blastoutput_reference> <blastoutput_db>/mreferate/dwolff/rapsearch2.23/db/ncbi_nr_dec15</blastoutput_db> <blastoutput_param> <parameters> </parameters> </blastoutput_param> <blastoutput_iterations> <iteration> <iteration_iter-num>1</iteration_iter-num> <iteration_query-def>gene_id_1</iteration_query-def> <iteration_query-len>37</iteration_query-len> <iteration_hits> <hit> <hit_num>1</hit_num> <hit_id>gi|939543432|gb|kpv42113.1|</hit_id> <hit_def>peptide abc transporter atpase, partial [kouleothrix aurantiaca]</hit_def> <hit_accession>kpv42113.1</hit_accession> <hit_len>162</hit_len> <hit_hsps> <hsp> <hsp_num>1</hsp_num> <hsp_bit-score>58.151</hsp_bit-score> <hsp_score>139</hsp_score> <hsp_evalue>-5.6061</hsp_evalue> <hsp_query-from>1</hsp_query-from> <hsp_query-to>37</hsp_query-to> <hsp_hit-from>54</hsp_hit-from> <hsp_hit-to>90</hsp_hit-to> <hsp_query-frame>0</hsp_query-frame> <hsp_identity>28</hsp_identity> <hsp_positive>33</hsp_positive> <hsp_align-len>37</hsp_align-len> <hsp_qseq>mvvcdepvsaldvsvqaavltllveiqqqhetamili</hsp_qseq> <hsp_hseq>lvlcdepvsaldvsvqaavlnllleiqrehgttmifi</hsp_hseq> <hsp_midline>+v+cdepvsaldvsvqaavlll+eiq++htmii</hsp_midline> </hsp> </hit_hsps> </hit> <hit> <hit_num>2</hit_num> <hit_id>gi|385280362|gb|eif44286.1|</hit_id> <hit_def>oligopeptide abc transporter [gamma proteobacterium bdw918]</hit_def> <hit_accession>eif44286.1</hit_accession> <hit_len>327</hit_len> <hit_hsps> <hsp> <hsp_num>1</hsp_num> <hsp_bit-score>54.6842</hsp_bit-score> <hsp_score>130</hsp_score> <hsp_evalue>-4.56249</hsp_evalue> <hsp_query-from>1</hsp_query-from> <hsp_query-to>37</hsp_query-to> <hsp_hit-from>169</hsp_hit-from> <hsp_hit-to>205</hsp_hit-to> <hsp_query-frame>0</hsp_query-frame> <hsp_identity>24</hsp_identity> <hsp_positive>31</hsp_positive> <hsp_align-len>37</hsp_align-len> <hsp_qseq>mvvcdepvsaldvsvqaavltllveiqqqhetamili</hsp_qseq> <hsp_hseq>lvicdepvsaldvsvqaqiinllqelqtehntamlfi</hsp_hseq> <hsp_midline>+v+cdepvsaldvsvqa++lle+q+htam+i</hsp_midline> </hsp> </hit_hsps> </hit> <hit> <hit_num>3</hit_num> <hit_id>gi|550913550|ref|wp_022666548.1|</hit_id> <hit_def>abc transporter atp-binding protein [desulfospira joergensenii]</hit_def> <hit_accession>wp_022666548.1</hit_accession> <hit_len>721</hit_len> <hit_hsps> <hsp> <hsp_num>1</hsp_num> <hsp_bit-score>53.5286</hsp_bit-score> <hsp_score>127</hsp_score> <hsp_evalue>-4.21462</hsp_evalue> <hsp_query-from>1</hsp_query-from> <hsp_query-to>37</hsp_query-to> <hsp_hit-from>549</hsp_hit-from> <hsp_hit-to>585</hsp_hit-to> <hsp_query-frame>0</hsp_query-frame> <hsp_identity>27</hsp_identity> <hsp_positive>31</hsp_positive> <hsp_align-len>37</hsp_align-len> <hsp_qseq>mvvcdepvsaldvsvqaavltllveiqqqhetamili</hsp_qseq> <hsp_hseq>mvicdepvsaldvsvqaavlnllneikeemgttmifi</hsp_hseq> <hsp_midline>mv+cdepvsaldvsvqaavlllei+++tmii</hsp_midline> </hsp> </hit_hsps> </hit> ... </iteration_hits> <iteration_stat> <statistics> <statistics_db-num>77704984</statistics_db-num> <statistics_db-len>28292933896</statistics_db-len> <statistics_hsp-len>0</statistics_hsp-len> <statistics_eff-space>0</statistics_eff-space> <statistics_kappa>0.041</statistics_kappa> <statistics_lambda>0.267</statistics_lambda> <statistics_entropy>0.14</statistics_entropy> </statistics> </iteration_stat> </iteration> </blastoutput_iterations> </blastoutput>
first trials write script gave no results , disastrous. hope can me.
i updated script match new xml structure new xml above.
check comments in code below:
use strict; use warnings; use xml::simple; #first, parse xml hash open $mf1,'<', 'my_xml.xml'; $xml = xmlin($mf1); close $mf1; =com $xml sample $var1 = { 'blastoutput_db' => '/mreferate/dwolff/rapsearch2.23/db/ncbi_nr_dec15', 'blastoutput_program' => 'rapsearch', 'blastoutput_param' => { 'parameters' => {} }, 'blastoutput_reference' => 'yonganzhao,haixutangandyuzhenye.rapsearch2:afastandmemory-efficientproteinsimilaritysearchtoolfornextgenerationsequencingdata.bioinformatics2012,28(1):125-126', 'blastoutput_version' => 'rapsearch2', 'blastoutput_iterations' => { 'iteration' => { 'iteration_hits' => { 'hit' => [ { 'hit_accession' => 'kpv42113.1', 'hit_id' => 'gi|939543432|gb|kpv42113.1|', 'hit_hsps' => { 'hsp' => { 'hsp_hseq' => 'lvlcdepvsaldvsvqaavlnllleiqrehgttmifi', 'hsp_bit-score' => '58.151', 'hsp_identity' => '28', 'hsp_align-len' => '37', 'hsp_query-frame' => '0', 'hsp_query-from' => '1', 'hsp_qseq' => 'mvvcdepvsaldvsvqaavltllveiqqqhetamili', 'hsp_evalue' => '-5.6061', 'hsp_midline' => '+v+cdepvsaldvsvqaavlll+eiq++htmii', 'hsp_num' => '1', 'hsp_positive' => '33', 'hsp_hit-from' => '54', 'hsp_score' => '139', 'hsp_hit-to' => '90', 'hsp_query-to' => '37' } }, 'hit_len' => '162', 'hit_num' => '1', 'hit_def' => 'xx' }, { 'hit_accession' => 'eif44286.1', 'hit_id' => 'gi|385280362|gb|eif44286.1|', 'hit_hsps' => { 'hsp' => { 'hsp_hit-from' => '169', 'hsp_positive' => '31', 'hsp_score' => '130', 'hsp_query-to' => '37', 'hsp_hit-to' => '205', 'hsp_num' => '1', 'hsp_midline' => '+v+cdepvsaldvsvqa++lle+q+htam+i', 'hsp_align-len' => '37', 'hsp_query-frame' => '0', 'hsp_qseq' => 'mvvcdepvsaldvsvqaavltllveiqqqhetamili', 'hsp_evalue' => '-4.56249', 'hsp_query-from' => '1', 'hsp_bit-score' => '54.6842', 'hsp_identity' => '24', 'hsp_hseq' => 'lvicdepvsaldvsvqaqiinllqelqtehntamlfi' } }, 'hit_def' => 'xx', 'hit_len' => '327', 'hit_num' => '2' }, =cut # save second file array open $mf2, '<', 'file2'; chomp( @defs = <$mf2> ); close $mf2; # update xml hash foreach $iteration ( @{ $xml->{'blastoutput_iterations'}{'iteration'}}){ foreach $hit ( @{$iteration->{'iteration_hits'}{'hit'}}){ $hit->{'hit_def'} = @defs[ $hit->{'hit_num'} - 1 ]; }} # write new xml file1 open $mf1_new, '>', 'my_xml.xml'; xmlout($xml, outputfile => $mf1_new, noattr => 1, rootname => 'blastoutput' ); close $mf1_new;
Comments
Post a Comment