JOINing Two Excel Worksheets
【Question】
I’m working on combining two excel worksheets. Before I start, I’d like to mention that I also have mysql workbench, so I’m open to working on this issue in either sql or vba (I should learn both). I’m working with .bed files, which are lists of genomic coordinates. In short, the data is indexed by chromosome number (ie:chr2) and then has a numerical start and stop location on the chromosome. These numerical locations can span a large range (ie:100-10,000) or be a single position (ie: 999-1000). I have a list of coordinates that cover a large range, and in a separate file I have a list of single positions.
Example of a file with ranges:
chromosome start stop
chr1 45616321
chr3 984211253
Example of file with single positions:
chromosome start stop
chr1 52135214
chr3 1025410255
I would like to combine these worksheets such that if a location in my list of single positions is found within the range in my list of ranges, the locations for both are listed in the same row. The lists are 1000s of locations long, so I’d also like this program to loop through every row. Using the example data listed above, I’d like my output to look like the following:
Example of desired output:
chromosome start stop chromosome start stop
chr1 45616321 chr1 52135214
chr3 984211253 chr3 1025410255
There is a high probability that multiple single positions will fall within a single range, and I would like these to be listed as separate rows.
Below is a SQL solution. It works but it is complicated, because data needs to be retrieved from and then exported to the database.
select a.chromosome, a.start, a stop, b.chromosome, b.start, b.stop from ranges_table a, positions_table b where b.start >= a.start and b.stop <= a.stop
The following VBA solution also consists of the SQL query and the retrieval and export code.
SubSqlJoin()
Dim oConnAsNew ADODB.Connection
Dim oRSAsNew ADODB.Recordset
Dim sPath
Dim sSQLAsString, wbAsWorkbook
Set wb =ThisWorkbook
sSQL ="select a.chromosome, a.start, a stop,"& _
"b.chromosome, b.start, b.stop"& _
"from <ranges_table> a, <positions_table> b"& _
"where b.start >= a.start and b.stop <= a.stop"
sSQL =Replace(sSQL,“<ranges_table>”, _
Rangename(wb.Worksheets(“Ranges”).Range(“A1”).CurrentRegion))
sSQL =Replace(sSQL,“<positions_table>”, _
Rangename(wb.Worksheets(“Positions”).Range(“A1”).CurrentRegion))
Ifwb.Path<>""Then
sPath = wb.FullName
Else
MsgBox"The workbook must be saved first!"
ExitSub
EndIf
oConn.Open"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=‘"& sPath &"’;"& _
“Extended Properties=‘Excel 12.0;HDR=Yes;IMEX=1’;”
oRS.OpensSQL, oConn
IfNotoRS.EOFThen
wb.Worksheets(“Results”).Range(“A2”).CopyFromRecordsetoRS
Else
MsgBox"No records found"
EndIf
oRS.Close
oConn.Close
EndSub
FunctionRangename(rAsRange)AsString
Rangename=“[”& r.Parent.Name &“$”& _
r.Address(False,False) &“]”
EndFunction
【Answer】
It’s rather complicated to do this in VBA. Here’s a solution in SPL:
A |
|
1 |
=file(“D:\\range.xlsx”).xlsimport@t() |
2 |
=file(“D:\\position.xlsx”).xlsimport@t() |
3 |
=xjoin(A1;A2,start>A1.start && stop<A1.stop) |
4 |
=file(“D:\\result.xlsx”).xlsexport(A3,#1.chromosome,#1.start,#1.stop,#1.chromosome,#2.start,#2.stop) |
Result of executing the SPL script:
A1: Import one of the Excel worksheet as a table sequence;
A2: Import another Excel worksheet as table sequence;
A3: Perform a cross join over the two table sequences;
A4: Export the resulting table sequence to a new Excel worksheet.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_Desktop/
SPL Learning Material 👉 https://c.scudata.com
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProcDesktop
Linkedin Group 👉 https://www.linkedin.com/groups/14419406/