File-Association-based Aggregate

Question

I am very new to this kind of work so bear with me please :) I am trying to calculate means over ranges of patterns. E.g. I have two files which are tab delimited:

The file coverage.txt contains two columns. The first column indicates the position and the second the value assigned to that position. There are ca. 4*10^6 positions.

coverage.txt:

1  10

 

2  30

 

3  5

 

4  10

 

The second file patterns.txt contains three columns 1. The name of the pattern; 2. The starting position of the pattern; and 3. End position of the pattern. The pattern ranges do not overlap. There are ca. 3000 patterns.

patterns.txt:

rpoB 1  2

 

gyrA 3  4

 

Now I want to calculate the mean of the values assigned to the positions of the different patterns and write the output to a new file containing the first column of the patterns.txt as an identifier.

Expected output.txt:

rpoB 20

 

gyrA 7.5

 

I think this can be accomplished using Awk but I do not know where to start. Your help would be greatly appreciated!

A solution:

With four million positions, it might be time to reach for a more substantial programming language than Shell/Awk, but you can do it in a single pass with something like this:

awk '{

 

 if (FILENAME ~ "patterns.txt") {

 

 min[$1]=$2

 

 max[$1]=$3

 

 } else {

 

 for (pat in min) {

 

 if ($1 >= min[pat] && $1 <= max[pat]) {

 

 total[pat] += $2

 

 count[pat] += 1

 

 }

 

 }

 

 }

 

}

 

END {

 

 for (pat in total) {

 

 print pat,total[pat]count[pat]

 

 }

 

}' patterns.txt coverage.txt

 

This omits any patterns that don’t have any data in the coverage file; you can change the loop in the END to loop over everything in the patterns file instead and just output 0s for the ones that didn’t show up.

 

Answer

You can also achieve your goal using esProc SPL. The Structured Process Language generates concise and efficient code.

SPL script:

 

A

1

=file("coverage.txt").import()

2

=file("patterns.txt").import()

3

=A2.new(#1,A1.select(#1>=A2.#2   && #1<=A2.#3).avg(#2))

4

=file("coverage.txt").export(A3)

esProc offers JDBC interface to enable the integration of an SPL script into a Java program. It’s easy to operation. For more details, see How to Call an SPL Script in Java.