Table of Contents
Introduction
SarcomaCellMinerCDB is an interactive web application that simplifies access and exploration of Sarcoma cancer cell line pharmacogenomic data across different sources (see Metadata section for more details). Navigation in the application is done using main menu tabs (see figure below). It includes 6 tabs: Univariate Analyses, Multivariate Analysis, Mutation variants, Metadata, Search, Help and Video tutorial. Univariate Analyses is selected by default when entering the site. Each option includes a side bar menu (to choose input) and a user interface output to display results. Analysis options are available on the top for both the Univariant Analysis and Regression model tabs (see sub-menu on figure). The sub-menu first option result is displayed by default (Figure 1).

Figure 1: Main application interface
Univariate Analyses
Molecular and/or drug response patterns across sets of cell lines can be compared to look for possible association. The univariate analysis panel includes 4 options: Plot data, Download Data, Compare Patterns and Tissue Correlation. Almost all options have the same input data in the left side panel.
- The x-axis data choices includes 4 fields to be filled by the user:
- x-Axis Cell Line Set selects the data source. The user can choose: NCI Sarcoma, Global Sarcoma, CCLE, GDSC, CTRP, Achilles or MD Anderson (see Data Sources for more details).
- x-Axis Data Type selects the data type to query. The options for this vary dependent on the source selected above, and appear in the x-Axis Data Type dropdown. See the Metadata tab for descriptions and abbreviations.
- Identifier selects the identifier of interest for the above selected data type. For instance, if drug activity for the NCI Sarcoma is selected, the user can enter a single drug name or drug ID (NSC number). The Search IDs tab explores potential identifiers interactively, or to download datasets of interest.
- x-Axis Range allows the user to control the x-axis range for better visualization.
- The y-axis data choices are as explained above for the x-axis.
- Selected tissues: by default, all tissues are selected and included in the scatter plot. To include or exclude cell lines from specific tissues, the user should specify:
- Select Tissues to include or exclude specific tissues
- Select Tissues of Origin Subset/s functionality at the bottom of the left-hand panel. The tissues of Origin are organized as a tree and are all selected by default. In order to select a specific tissue, the user should click on the root of the tree represented by the triangle icon to expand the tree recursively until reaching a specific sub tree or leaf. The selection is finalized by clicking on the leaf label.On Macs, more than one tissue of origin may be selected using the “command” button. On PC’s use the “control” key. All cell lines were mapped to the four-level OncoTree cancer tissue type hierarchy developed at Memorial Sloan-Kettering Cancer Center. In the CellminerCDB application, a tissue value is coded as an OncoTree node that can include elements from level 1 to level 4 separated by “:” character.
- Tissues to Color to locate cell lines related to desired tissues within the scatter plot. By default, the cell lines are colored by their OncoTree cancer tissue level 1 pre-assigned color. The user has now the option to select up to 4 tissues with different colors (red, green, dark blue and orange) and the remaining cell lines will be colored in light blue. The Show Color checkbox should be active.
Plot Data
Any pair of features from different sources across common cell lines can be plotted (as a scatterplot) including the resultant Pearson correlation and p-value. The p-value estimates assume multivariate normal data, and are less reliable as the data deviate from this. Please use the scatter plot to check the data distribution (e.g., for outlying points outside of a more elliptically concentrated set).
Some options are available to play with the plot image using icons on the top from left to right:
 | Downloads the plot as a png. |
 | Allows the user to zoom in on an area of interest by clicking and dragging with the pointer. |
 | Autoscales the image. |
 | Allows the user to create horizontal and vertical line from either a cell line dot or the regression line, by hovering over them. |

Figure 2: An example scatterplot of SLFN11 gene expression (x-axis) versus Topotecan drug activity (y-axis) both from the NCI Sarcoma. The Pearson correlation value and p value appear at the top of the plot. A linear fitting curve is included. This is an interactive plot and whenever the user changes any input value, the plot will be updated. Any point in the plot can be hovered over to provide additional information about cell line, tissue, Onco tree designation, and x and y coordinate values.
View Data
This option both displays the data selected from the **Plot Data** tab in tabular form, and provides a **Download selected x and y axis data as Tab-Delimited File** option. The user can change the input data in the left selection panel as described for Plot Data. The displayed table include the cell line, the x-axis value, the y-axis value, the tissue of origin and the 4 onco-tree levels. Within the header the selected features are prefixed by the data type abbreviation and post-fixed by the data source.

Figure 3: Shows the selected values for SLFN11 gene expression (x-axis) and Topotecan (id 609699) drug activity (y-axis) from the NCI Sarcoma across all common lines. The features are coded as expSLFN11_uniSarcoma and act609699_uniSarcoma where “exp” and “act” represent respectively prefixes for microarray gene expression and drug activity.
Compare Patterns
This option allows one to compute the correlation between the selected feature as defined from the specified **x- Axis Cell Line Set, x-Axix Data Type**, and **Identifier** and either all drug or all molecular data from the (same) x-Axis or y-Axis source. By default all tissues are selected however the user can restrict the analysis to specific tissue of origin.
Pearson’s correlations are provided, with reported p-values (not adjusted for multiple comparisons) in tabular form. This displays features are organized by level of correlation, and includes target pathway for genes and mechanism of action (MOA) for drugs (if available).

Figure 4: Shows correlation results for SLFN11 gene with all other molecular features for all NCI Sarcoma datasets sorted by correlation value with gene location and target pathways (annotation field).
Tissue Correlation
This option enables to display per tissue of origin (oncotype level 1) the number of cell lines with complete observations (non missing values), the correlation between the selected paired features and its p-value.

Figure 5: Shows the correlation between the selected values for SLFN11 gene expression (x-axis) and Topotecan (id 609699) drug activity (y-axis) from the NCI Sarcoma across all common lines by tissue of origin. Note: The value “ALL” means all available common tissues between the 2 selected features.
Multivariate Analysis
The ‘Multivariate Analysis’ option (or module) has multiple tabs including Heatmap, Data, Plot, Cross-Validation, Tehnical Details and Partial Correlation (described below), and allows construction and assessment of multivariate linear response prediction models within a single cell line set. For instance, we can assess prediction of a drug activity based on some genes expression. To construct a regression model, you first need to specify the input data in the left side panel.
- The response variable is chosen by selecting:
- Response Cell Line Set selects the data source for the response variable. The user can choose: NCI Sarcoma, CCLE, GDSC or CTRP (see the Data Sources section of Help for more details on these Cell Line Sets).
- Response Data Type selects the data type for the response variable (example: a drug or a molecular dataset). The options for this vary dependent on the source selected above, and appear in the Response Data Type dropdown. See the Metadata tab for data types description.
- Response Identifier selects the identifier for the response variable (e.g., a specific drug or gene identifier)
- The predictor variables are chosen by selecting:
- Predictor Cell Line Set selects the data source for the predictor variable. The user can choose: NCI Sarcoma, CCLE, GDSC or CTRP.
- Predictor Data Type/s selects the data types(s) for the predictors variables. Use command button on Macs or control key on PCs to select more than one dataset.
- Minimal Predictor Range provides a required minimum value for the identifier to be included for the first listed data type. The default is 0. One may increase this value to eliminate predictors that are considered to have insufficient range to be biologically meaningful.
- Predictor Identifiers selects the identifiers for the predictors.When using the Linear Regression algorithm, predictors are required to be enter. In figure 5, we explore linear model prediction of Topotecan drug activity in the NCI Sarcoma choosing SLFN11 and BTPF gene expression. Identifiers from different sources may be combined using 2 methods. In the first, select multiple Data Types as desired, and enter your identifiers. The model will be built automatically using those Data Types and Identifiers. For example, if expression and mutation are selected as Data Types and SLFN11 and BPTF are entered as identifiers, the model will be built using 4 identifiers: expSLFN11, expBTPF, mutSLFN11 and mutBTPF. In the second, more specific approach, you enter the identifier with the data type prefix. For example, if your predictor variables are specifically the expression value for SLFN11 and mutation value for BTPF then you can enter as identifiers: expSLFN11 and mutBTPF. When using the Lasso algorithm, predictors are optional for the Lasso algorithm (see point 4) since it identifies automatically the ones that best fit the Lasso model.
- Select Tissue/s of Origin is used to include or exclude specific tissues, as defined in the next step. By default, all tissue types are included, howver you can select one or any multiple of tissue types (to include or exclude). Use the radio buttons To include or To exclude to select specific tissues to include or exclude. To make selections on Macs, use the “command” key. To make selections on PC’s use the “control” key
- Algorithm: by default, the Linear Regression model is selected however you can also select the Lasso model (penalized linear regression model) machine learning approach. Linear regression is a linear approach to modeling the relationship between a response (or dependent variable) and one or more predictor variables (or independent variables). It is implemented using the R stats package lm() function. 10-fold cross validation is applied to fit model coefficients and predict response, while withholding portions of the data to better estimate robustness. Lasso is Least absolute selection and shrinkage operator, a penalized linear regression model. Lasso is implemented using the cv.glmnet function (R package glmnet). Lasso performs both variable selection and linear model coefficient fitting. The lasso lambda parameter controls the tradeoff between model fit and variable set size. Lambda is set to the value giving the minimum error with 10-fold cross-validation. The lasso lambda parameter controls the tradeoff between model fit and variable set size. The Lambda is set to the value giving the minimum error with 10-fold cross-validation. Set.seed, the initial seed is set to 1. Alpha is set to one. The minimum lambda is used to select the intercept and the coefficient for the variable (there is no range). 10-fold cross validation is applied to fit model coefficients and predict response, while withholding portions of the data to better estimate robustness. For further details on either of these outputs, see the respective R packages. If Lasso algorithm is selected, you have to specify:
- Select Gene Sets: The gene selection is based on curated gene sets such as DNA Damage Repair DDR or Apoptosis. The user can select one or more gene sets.
- Maximum Number of Predictors allows choice of the number of predictors (default 4)
Once all the above information is entered, a regression model is built and the results are shown in different ways such as the technical details of the model, observed vs. predictive responses plots or variables heatmap. Find below an explanation of different output for the regression model module.
Heatmap
This option provides the observed response and predictor variables across all source cell lines as an interactive heatmap. For the heatmap visualization, data are range standardized (subtract the minimum, and divide by the range) to values between 0 and 1, based on the value range within all rows of a given data type (by default) or within each row of data (if ‘Use Row Color Scale’ is selected). For data types other than mutation data, the range is trimmed to the difference between the 95th and 5th percentiles; values below or above the 5th and 95th percentile values are scaled to 0 and 1, respectively. In the case of mutation data, the range used for scaling is the difference between the maximum and minimum values. If the values within a data type (or data row if ‘Use Row Color Scale’ is selected) are constant, the scaled value for heatmap visualization is set to 0.5.
The user can restrict the number of cell lines to those that have the highest or lowest response values by selecting Number of High/Low Response Lines to Display. The user can download the heatmap related data by clicking on Download Heatmap Data.

Figure 6: An example heatmap where we selected topotecan as a response variable and SLFLN11 and BPTF gene expression as predictor variables. In this example, we chose to display only 60 cell lines that have the most 30 highest and 30 lowest values for topotecan activity.
If the Lasso algorithm is selected (see below) more predicted variables are shown (PSN2, SMARCD1, DFFB and ARID1A)

Figure 7: Same example as previous figure with the lasso algorithm
Data
This option shows the detailed data for the model variables for each cell line. Both the 10-fold cross validation (CV) as well as the predicted responses are given. The data is displayed as a table with filtering options for each column.
