Data Analyst and Technical Consultant

Data step v PROC SQL

Both the Data Step and Proc SQL can slice and dice your data, but which one do you choose?

The short answer is, it depends.

  • The Data Step works sequentially so will work through each observation one by one to do your merge, whereas PROC SQL doesn’t care what order your data is in so in some cases can be more efficient.
  • If your data set is small it may be worth using the Data Step as all results are returned in order but if you are working with something larger consider PROC SQL.  Either way, SQL is common and a good skill to have.

In this post you’ll learn how to subset, sort, group, eliminate duplicates, and combine data using PROC SQL alongside the SAS snippet.

 

Creating a subset v Creating a table

 

data employeedata2;
set employeedata;
where 20000 <= salary <= 40000;
run;

 

proc sql;
create table employeedata2 as
select * /* the asterix returns all columns from the table */
from employeedata
where salary BETWEEN 20000 and 40000;
quit; /* use quit instead of run */

 

Sorting a dataset v Sorting a table

 

proc sort data=partnerdata; by clientcapacity; run;

 

proc sql;
create table partnerdata2 as
select *
from partnerdata
order by clientcapacity; /* the default is ascending but can be changed to descending using DESC */
quit;

 

Deduping a dataset v Select Distinct

 

proc sort data=customerdata nodupkey out=customerdata2; /* optional new table as output */
by customerid;
run;

 

proc sql;
create table customerdata2 as
select distinct *
from customerdata
order by customerid;
quit;

 

Merging v INNER JOIN

 

proc sort data=students1; BY id; run;
proc sort data=students2; BY id; run;
data students3;
merge students1 ( in= a)
students2 ( in= b );
by id;
if a and b;
run;

 

proc sql;
create table students3 as
select students1.id, students2.name students2.age
from students1
inner join students2 /* inner join returns results where the id variable appears in both tables*/
on students1.id = students2.id;
quit;

 

Merging v LEFT JOIN

 

proc sort data=students1; BY id; run;
proc sort data=students2; BY id; run;
data students3;
merge students1 ( in= a)
students2;
by id;
if a ;
run;

 

proc sql;
create table students3 as
select students1.id, students2.name students2.age
from students1
left join students2 /* left join returns all results from the 'left' table and leaves a blank where the variable does NOT appear in both tables*/
on students1.id = students2.id;
quit;

 


Learn more about how the data is processed and more about which JOIN you should use