Archive

Archive for November 14, 2015

Using R to load CSV and Excel file in Sql Server 2016

November 14, 2015 Leave a comment

It’s been more than two and half years I blogged last time . Lot of changes personally and professionally 🙂 . Finally had some time to learn and test some of the new features coming up in Sql Server 2016.

As soon as I heard about “Integration of R” as new features in Sql Server 2016 I couldn’t resist trying it out for some of the frequent and important task I do when I have to analyze data . Loading .csv or .xlsx file data in Sql server. Loading .csv file and .xlsx in Sql Server was always possible through tsql (bulk insert)  , SSIS or Import/Export but there were always some hoops to go through when loading .xlsx type of data. With introduction of R in Sql Server I think we have one more easy way to load these type of files. Lets see how it works.

“Assuming you have already installed Sql Server 2016 CTP 3 and run all post installation configurations for R.”

I would load baseball.csv file in sql server first. With introduction of R in Sql Server 2016 we can execute R scripts directly from SSMS using sp sp_execute_external_script like this:

execute sp_execute_external_script
@language = N’R’
, @script = N’ setwd(”D:/csvdirectory”);
mytextvariable <- read.csv(“baseball.csv”,header = TRUE);
OutputDataSet <- as.data.frame(mytextvariable);’
, @input_data_1 = N’ ‘
WITH RESULT SETS (([Team] Char(5) not null, [League] char(3) NOT NULL, [Year] int not NULL , Win int not null, Playoffs bit not null ));

I am using 3 parameters here:

@Language to specify the script language “R”.

@script is the parameter where all R code goes .. This is where I included my basic R  script. In this script I first change working directory to directory where baseball.csv file exists. read.csv() is R function to to read csv file . Finally I convert csv file data to data frame and pass it to OutputDataSet variable. Since we generate values using just the R script so we leave the input query string in @input_data_1 blank as this is required parameter. We can also type any valid sql statement. This will not be executed by R anyways.

Result sets is where I create column names . I simply created the same column name as in .csv file.  When this script executes the values defined by the WITH RESULT SETS statement are returned in the Results pane in SSMS . Here is the screen shot.

rscript_output

 

Now what if I want to load this data in table for further analysis or to join with other tables. I couldn’t find any published way to do this so thought of trying it out myself. First thing came to mind was Insert…Exec method for loading stored procedure results in table. And after couple of dynamic sql and single quotes related issue it worked finally 🙂

Here is how it works:

I created temp table to store data. This could be permanent table too.

create table #t ([Team] Char(5) not null, [League] char(3) NOT NULL, [Year] int not NULL , Win int not null, Playoffs bit not null );

After that I insert into temp table and execute R script as dynamic sql. it’s same script as above but executed as dynamic sql like this:

Insert into #t
Execute
(‘
execute sp_execute_external_script
@language = N”R”
, @script = N” setwd(“D:/csvdirectory”);
mytextvariable <- read.csv(“baseball.csv”,header = TRUE);
OutputDataSet <- as.data.frame(mytextvariable);”
, @input_data_1 = N” ;”
WITH RESULT SETS (([Team] Char(5) not null, [League] char(3) NOT NULL, [Year] int not NULL , Win int not null, Playoffs bit not null ));’
);

select * from #t;

Voila. Data is loaded in table. Now we can do all kind of analysis or use it whatever way we need to.

I am going to use it a lot to automate .csv file or .xlsx file upload to table . I didn’t check with huge csv or xlsx files but I think some performance penalty is ok for my scenarios looking at other benefits we have with this approach.

Loading .xlsx file requires to load xlsx library in the R script else it won’t work. I had to install xlsx package first into my R environment then it is same script as above with extra line of code to load xlsx library (highlighted in red).

Here is script to load excel files :

Insert into #t
Execute
(‘
execute sp_execute_external_script
@language = N”R”
, @script = N” setwd(“R:/DBA”);
library(xlsx);
mytextvariable <- read.xlsx(“baseball.xlsx”, sheetName=”Sheet1″);
OutputDataSet <- as.data.frame(mytextvariable);”
, @input_data_1 = N” ;”
WITH RESULT SETS (([Team] Char(5) not null, [League] char(3) NOT NULL, [Year] int not NULL , Win int not null, Playoffs bit not null ));’
);

That’s it . I am really excited to see lot of cool features in Sql Server 2016 .  Some of these features are definitely going to help me a lot in my day to day work . Couple of features I am looking forward to learn and work are R in Sql Server and load\query JSON format data .

Will continue to test and learn some more R related Sql Server features and blog here.

Thanks,

NJ

 

 

Categories: Sql Server Tags: