Column Processing via Sequence Numbers

Question

I’ve done a fair amount of searching through SO and have not found or perhaps have not understood the solution to the following question. I have the following example data frame:

df <- data.frame(Day = c('Day 1', 'Day 2', 'Day 3', 'Day 1', 'Day 2', 'Day 3'),

 

 WBC.esc = c('Yes', 'Yes', 'Yes', 'No', 'No', 'No'),

 

 WBC.deesc = c('No', 'No', 'No', 'No', 'Yes', 'Yes'),

 

 WBC.cdm = c('Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes'),

 

 Hgb.esc = c('Yes', 'No', 'Yes', 'No', 'Yes', 'No'),

 

 Hgb.deesc = c('No', 'Yes', 'No', 'No', 'No', 'Yes'),

 

 Hgb.cdm = c('No', 'No', 'No', 'No', 'No', 'No'),

 

 Plt.esc = c('No', 'No', 'Yes', 'No', 'No', 'No'),

 

 Plt.deesc = c('Yes', 'Yes', 'No', 'No', 'Yes', 'Yes'),

 

 Plt.cdm = c('Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes'))

 

# Day WBC.esc WBC.deesc WBC.cdm Hgb.esc Hgb.deesc Hgb.cdm Plt.esc Plt.deesc Plt.cdm

 

\# 1 Day 1 Yes No Yes Yes No No No Yes Yes

 

\# 2 Day 2 Yes No Yes No Yes No No Yes Yes

 

\# 3 Day 3 Yes No Yes Yes No No Yes No Yes

 

\# 4 Day 1 No No Yes No No No No No Yes

 

\# 5 Day 2 No Yes Yes Yes No No No Yes Yes

 

\# 6 Day 3 No Yes Yes No Yes No No Yes Yes

 

Days represent the number of days a patient has been in the hospital. The other column names refer to types of orders. I would like to answer the question of whether there were orders made (“Yes”) in any of the WBC/Hgb/Plt.esc or WBC/Hgb/Plt.deesc categories. In the large dataset there are other prefixes before *.esc such as HeartRate.esc, BP.esc so just searching on the .esc suffix won’t work. I know how to do this inefficiently by specifying each column name using the following code:

 

sum(df$Day=="Day 1" & (df$WBC.esc=="Yes" | df$WBC.deesc=="Yes" | df$Hgb.esc=="Yes" | df$Hgb.deesc=="Yes" | df$Plt.esc=="Yes" | df$Plt.deesc=="Yes"))

 

# [1] 1

This gives the correct answer of 1.

I have referenced the columns by their numbers which saves me from using the full column name (ex: dt[,2] instead of dt$WBC.esc)but even this seems prohibitively tedious when in my actual data set there are sometimes 40-50 columns to reference. In the full data set I would like to be able to do something like the following (using Excel terminology here since that’s what I’m familiar with, with alphabets being column names). The idea is that it will take a row, look through all the specified columns in that row for a “Yes” value and if it finds it, count it as a 1 or if it does not find “Yes” then count it as a 0.

countif(B:E=="Yes" | H:Y=="Yes" | AE:AG=="Yes")

 

I also tried this:

sum(df$Day=="Day 1" & (dt\[,2:3\]=="Yes" | df\[,5:6\]=="Yes" | df\[,8:9\]=="Yes"),na.rm=TRUE)

 

This does not work as it returns an answer of “2” suggesting that it is counting each “Yes.” Also this function does not run unless the number of columns reference in each is uneven such as dt[,2:4] | dt[,8:9] in which case I get:

“Error: binary operation on non-conformable arrays”

I’ve searched through a number of questions but in the posts I’ve seen the column names are individually listed out, not referred to in aggregate. How can I perform this operation referencing multiple sets of adjacent columns without having to list each variable separately?

 

Answer

You can achieve what you want by entering a sequence of column numbers, like 2,3,5,6,8,9, and then modify the length and member values of the sequence to change the query condition. This algorithm requires referencing column numbers dynamically. SPL manipulates this type of structured computation effortlessly:

A

1

=file("D://data.csv").import@t()

2

=[2,3,5,6,8,9]

3

=A2.("#"/~/"==\"Yes\"").concat("||")

4

=A1.count(Day=="Day1"    &&(${A3}))

${…} treats a string as an expression to execute;

A3’s result is a string: #2==“Yes”||#3==“Yes”||#5==“Yes”||#6==“Yes”||#8==“Yes”||#9==“Yes”

The esProc runtime environment is similar to that of R. For more information about processing text data, see Text Files. About macro in esProc, see Creative Uses of Stirngs.