Group a File & Pull Random Values by Time Intervals
【Question 】
This is a bit complex, and I greatly appreciate any help! I am trying to randomly sample rows from a .csv file. Essentially, I want a resulting file of unique locations (Locations are specified by Easting and Northing columns of the data file, below). I want to randomly pull 1 location per 12 hour period per SessionDate in this file (12 hour periods divided into: between 0631 and 1829 hours and between 1830 and 0630 hours; Given as Start: and End: in Data File, below); BUT if any 2 locations are within 6 hours of each other (based on their Start: time), for that location to be tossed, and a new location to be randomly drawn, and for this sampling to continue until no new locations are drawn (i.e., sampling WITHOUT replacement). I have been trying to do this with Python, but my experience is very limited. I tried first putting each row into a dictionary, and recently each row into a list, as follows:
import random
import csv
f=open('file.csv',"U")
list=[]
for line in f:
list.append(line.split(','))
I'm unsure where to go from here - how to sample from these lists the way I need to, and then write them to an output file with my'unique' locations.
Here is the top few lines of my data file:
SessionDate Start: End: Easting Northing
27-04-07 18:00 21:45 174739 9785206
28-04-07 18:00 21:30 171984 9784738
28-04-07 18:00 21:30 171984 9784738
28-04-07 18:00 21:30 171984 9784738
28-04-07 18:00 21:30 171984 9784738
It gets a bit complicated as some of the observations span midnight, so they may be on different dates, but can be within 6 hours of each other (which is why I have this criterion), for example:
SessionDate Start: End: Easting Northing
27-04-07 22:30 23:25 171984 9784738
28-04-07 0:25 1:30 174739 9785206
【Answer】
It’s convenient to achieve the whole algorithm in SPL (Structured Process Language):
A |
B |
C |
D |
|
1 |
=file("D:\\source.txt").import@t(SessionDate: date:"dd-MM-yy",Start:time:"HH:mm",End:time:"HH:mm",Easting,Northing).derive(datetime(SessionDate,Start):DateStart) |
|||
2 |
for A1.group(Easting,Northing) |
=null |
=A2.sort(DateStart).group@o(between(Start,time("6:30","HH:mm"):time("18:30","HH:mm"))) |
|
3 |
for C2 |
=if(B2,B3.select(interval@s(B3.DateStart,DateStart)>60*60*6),B3) |
||
4 |
if C3!=[] |
=@|(B2=C3(rand(C3.len())+1)) |
||
5 |
=file("D:\\result.txt").export@t(D4) |
A1: Read in the data file where the first line is used as column names, set the data format for SessionDate, Start and End, and then join SessionDate and Start up into a complete DateStart value.
A2: Group A1’s table by Easting and Northing and loop each group to process data. The working scope of A2’s loop is B2-D4 (A2 references a loop variable); similarly the working scope of B3’s loop is C3-D4.
B2: Stores records selected by each previous loop, where the initial value is null.
C2: Sort A2 by DateStart and group the table every 12 hours for each Start. @o option enables a grouping without sorting.
B3: Loop each group in C2.
C3: Return the first group of records where B2==null; otherwise get the records where Start is 6 hours later from B2’s Start.
C4-D4: If C3 gets certain records, pull a random one from them to store in B2 and then append it to D4. The sign @ represents the current cell D4.
D4’s result:
A5: Write D4’s records to the target file result.txt.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL