Warning: file_get_contents(http://api.sharedcount.com/?url=http%3A%2F%2Fwww.milanor.net%2Fblog%2F%3Ffeed%3Drss2&apikey=) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 401 Unauthorized in /htdocs/public/www/blog/wp-content/plugins/fast-easy-social-sharing/fasteasysocialsharing.php on line 34
Dear R users,
the May 2014 public training course schedule for Milano (Italy) based courses is as follows:
|Web Applications with R and Shiny||May 15, 2014|
|Reports in R with RStudio||May 16, 2014|
|Basic R Programming||May 22, 2014|
|Data Visualization with R||May 23, 2014|
In a previous post on my personal blog about creating Pivot Tables in R with melt and cast we covered a simple way to generate sales reports and summary tables from a data set consisting of orders. It is often said that a picture is worth 1000 words, so in this series of posts we will focus on how to create visual representations and summaries of the same data.
Our graphical library of choice for the job will be ggplot2 (what else?), even though we are mostly going to use it in its simplest format, which is through qplot. I have written other posts on ggplot2 which you may want to also read.
1. Getting started
If you haven't done it yet, please complete steps 1, 2 and 3 in my previous post Pivot Tables in R with melt and cast. The file with the data can be obtained from the link at the bottom of that post. Once completed, you should have your data set loaded in R and ready for the next steps.
2. Checking the data
Before starting to plot any data frame with ggplot2, it is a good idea to check the data structure and make sure all variables have the correct type. As a matter of fact ggplot2 is a very smart library and will attempt to plot your data even if they are not in the expected format. While this may or may not produce a warning message, the results may end up being far from what we expect. Better to check in advance and save us the pain of a long troubleshooting afterwards.
It has been pointed out that str is one of the most useful functions in R and this is surely true! Let's take a look at the structure of our data set.
> str(data) 'data.frame': 799 obs. of 5 variables: $ Country : Factor w/ 2 levels "UK","USA": 1 2 2 1 2 1 2 2 2 1 ... $ Salesperson : Factor w/ 9 levels "Buchanan","Callahan",..: 9 8 8 4 7 1 7 7 8 1 ... $ Order.Date : Factor w/ 384 levels "01/01/2004","01/01/2005",..: 118 131 147 183 183 197 197 209 271 281 ... $ OrderID : int 10249 10252 10250 10255 10251 10248 10253 10256 10257 10254 ... $ Order.Amount: num 1863 3598 1553 2490 654 ...
The use of str highlights indeed a problem with our data set. Order.Date is currently regarded by R as a factor instead of a Date. If we are thinking of grouping our sales data by quarter for example, it would be useful to convert it to a Date class so we can use data manipulation functions such as quarter() to extract the quarter of the year. This is an easy fix.
data$Order.Date <- as.Date(data$Order.Date, "%d/%m/%Y")
Note that the format string using in as.Date has to match the format of the date in Order.Date. In this case %d represents the day in digits (1-31), %m the month in digits (1-12) and %Y (capital Y) the year in the 4-digits format (1900-2999).
After the conversion, our data set structure looks like this.
> str(data) 'data.frame': 799 obs. of 5 variables: $ Country : Factor w/ 2 levels "UK","USA": 1 2 2 1 2 1 2 2 2 1 ... $ Salesperson : Factor w/ 9 levels "Buchanan","Callahan",..: 9 8 8 4 7 1 7 7 8 1 ... $ Order.Date : Date, format: "2003-07-10" "2003-07-11" ... $ OrderID : int 10249 10252 10250 10255 10251 10248 10253 10256 10257 10254 ... $ Order.Amount: num 1863 3598 1553 2490 654 ...
We are now ready to create our sales dashboard.
3. A simple scatter plot of orders
Visualizing data in a simple and immediate format should always be the first step of a good visual data analysis. This allows to spot anomalies (for example outliers) and to get an overview of the content of the data set before aggregating and manipulating it further.
Let's start with a plot of all Order.Amount in a temporal sequence, which means by Order.Date.
library(ggplot2) qplot(x=Order.Date, y=Order.Amount, data=data)
Note few things here. First, we need to load the ggplot2 library before we can use qplot. This only needs to be done once in the same R session. Second, qplot is invoked with 3 arguments:
- x is the variable we want to plot on the horizontal axis
- y is the variable we want to plot on the vertical axis
- data is the name of the data set the variables belong to, which allows us to specify them just by variable name (such as Order.Date or Order.Amount) instead that in the full format (which would be data$Order.Date or data$Order.Amount)
Third, if we do not specify any further parameter, qplot uses its defaults for all the rest. Which default is used depends also on whether only y is specified or both x and y. When both x and y are specified, the default is to produce a scatter plot of y values versus x values. Another default is to use the variable names as labels for the axis, as well as apply the standard theme. Enough technicalities, let's get back to data visualization.
Let's say we are interested to show from which country the orders came from. Let's color code the points in the scatter plot according to the value of the Country variable in the data set, which is either USA or UK. With qplot this is as easy as adding an extra argument to the function call.
qplot(x=Order.Date, y=Order.Amount, color=Country, data=data)
Note that the color parameter can also be used with its British spelling of colour. Here is the resulting chart.
Once more, qplot has applied some defaults. First, a standard high-contrast color scheme to distinguish between the orders coming from the two different countries. Second, a legend on the left of the chart specifying how to read each color. The title of the legend is, by default, the name of the variable used to color code the points. Sweet!
Let's try to color code the points according to the sales person who took the order. Another easy one with qplot. Just change the color parameter to the use the Salesperson variable.
qplot(x=Order.Date, y=Order.Amount, color=Salesperson, data=data)
qplot has done a nice job to accommodate our request and color code the points by Salesperson, however there are too many colors and the chart is not really meaningful. Time to switch to a different view!
In Part 2 we will cover Bar Charts and how to make the best use of them. Till next time!
* This article originally appeared in Sales Dashboard in R with qplot and ggplot2 - Part 1
R is a powerful system for statistical analysis and data visualization. However, it’s not exactly user-friendly for data storage, so, still for several time your data will be archived using Excel, SPSS or similar programs.
How to open into R a file stored using the SPSS (.sav) format? There are some packages as
foreign which allow to perform this operation. The package foreign is already present in the base distribution of R system and you just need to activate it using the function
When you activated the package, you can open your file if you know where it’s located… the simpler method to locate a file (Yes, I know, you can set the work directory, but I have abrupt manners) is to send the instruction:
The system will open a window for the file access; you can look for your file in the folder where you have earlier archived it. R return the path to file:
Now, you can read the SPSS file using foreign, specifying the path to file (yes, you have understood, you need to copy and paste the path):
dataset = read.spss("C:\\PathToFile\\MyDataFile.sav", to.data.frame=TRUE)
Do you want avoid the copy and paste? You can assign the result of the instruction
file.choose() to an object named
db (abbreviation for database):
db = file.choose()
As before, you obtained the path to file, but this time R not shows it because you assigned to the object
db. Then, the object
db contains a character string identifying the path that R will have to follow to recover the file. Using this way, you need to run
file.choose() at every session, while if you write the path you can use it every time. Ready go?
dataset = read.spss(db, to.data.frame=TRUE)
read.spss() read the dataset in sav format. You must be careful, however, to specify as
TRUE the argument
to.data.frame, which requires to the function to arrange the data within a data frame (i.e. the class of R object for data tables).
Yolo, man. Another very simple method to open an SPSS file into R is to save the file in a format which R manage very well: the dat format (tab-delimited). So, you save your SPSS file in .dat and you behave as before, searching the file with
file.choose() and assigning the resulting string to an object.
The function to read the file, now, is
read.table(). Pay attention to missing data: if there are missing values, you should to indicate to R what is their code (e.g. 999), specifying a value for the argument
Do you have your file in .dat format?
db = file.choose() dataset = read.table(db, header=TRUE)
header = TRUE specifies that the first row of the file contains the variable names, therefore these values aren’t to interpret as data.
Being in a hurry? Conflate all the operations in just one line:
dataset = read.spss(file.choose(), to.data.frame=TRUE)
or, with .dat:
dataset = read.table(file.choose(), header = TRUE)
Once you import a file, it’s a good idea to verify that the reading was performed with accuracy.
To check the size of your database, use the
dim() function. You will obtain two numbers, the first one refers to the cases (rows in your database), while the second one is the number of variables (the columns of your database).
Further, can be useful to visualize a preview of data. To inspect the first six rows of the dataset, use the
To inspect the flast six rows of the dataset, use the
To inspect the structure of the dataset, use the
Do you want visualize the entire matrix of your dataset? If the data table is large, it is advisable to use the function
fix() which allows you to manually edit the cell content.
This post was originally written in Italian by Davide Massidda and Antonello Preti and published in InsulaR blog
How to open into R a Microsoft Excel file? Please read again the post Read Excel files from R.
My previous article shows an example in which data analysis requires a structured framework with R and OOP. In order to explain how to build the framework this article describes how to do that in more detail.
Using OOP means creating new data structures and defining their methods that are functions performing a specific tasks on the object. Defining a new data structure requires creating a new class and this articles shows how to create it through S4 R classes.
The R function that defines a new class, named
setClass() and the basic sintax is
The definition of a new class requires defining a basic structure that can be identical to another class. The new class inherits the structure and the methods from the other that in my example is
data.table. The sintax is
There is another option that is creating a class containing more basic data structures. This kind of class is similar to a list which elements belong to define classes. An example is a class with two slots, called
info, of class
setClass(Class='itemClass', representation(item='itemClass, info='list'))
This options allows defining more complex data structures containing all the data relevant to a problem. In this example,
item contains a table and
info contains some metadata. After an instance of a new class is created, it's possible to access its slots through
@ operator (instead of
$ like lists).
A useful tool of S4 classes is the option of testing new objects when created. R allows to define a function that performs some defined tests. In this example the function checks if a columns are present. The syntax is
functionTest = function(object) return(ifelse('col1' %in% names(object), TRUE, 'missing column')) setClass(Class='newClass', contains='data.table', validity = functionTest)
Classes that inherit from
data.table allow to define specific kinds of tables. Classes with slots allow to put together different tables and add further information. My next article will describe how to define methods that allow to generate the objects and to perform specific operations.
Seventh Torino R net meeting on 27 Mar 2014, exceptionally hosted at Polo Universitario di Asti, will have three presentations:
- Processing and analysis methods for DNA methylation array data, Giovanni Fiorito, Complex Systems for Life Sciences, University of Turin;
- Temporal Dominance of Sensations (TDS) data analysis using R, Silvia Salini, Università degli Studi di Milano and Luisa Torri, Università degli Studi di Scienze Gastronomiche, Pollenzo;
- Biodiversity analysis in R, Stefano Schiaparelli, Università degli Studi di Genova, Dipartimento di Scienze della Terra, dell’Ambiente e della Vita (DISTAV).
- 10:00 – 12:30, free R introductory course;
- 14:30 – 16:00, free spatial data analysis workshop;
- 16:00 – 18:00, Seventh Torino R net meeting.