Tutorial

ChemFlow-Tutorial

The ChemFlow tool is based on a data mining platform called Galaxy (https://wiki.galaxyproject.org). It is a web application dedicated to chemometry.

Accessibility

Access link

No installation, it's a web application. Just log on to https://vm-chemflow-francegrille.eu.

Browsers

We recommend the Google-Chrome browser. The graphical display functions of ChemFlow are less ergonomic on Mozilla-Firefox or Internet-Explorer. The use of ChemFlow in other browsers has not been fully tested, we cannot guarantee its proper functioning.
If your browser displays a security certificate invalid warning, click on advanced and accept the connection or continue on this site.

Help

All the documentation for the tool is in English. If you want to translate it, use the Chrome browser, in the menu bar of the ChemFlow tool (Galaxy), right click and choose Translate to French. In general, if you need help using ChemFlow, feel free to ask questions at chemflow@chemproject.org.

Connection

If this is your first connection

Click on "you may create one" and fill in the fields. Ignore the message that talks about the activation link. And 1 minute later, log in with your credentials!
In case of problem, write to chemflow@chemproject.org with the subject of the message: "Chemflow - access request". Preferably, use your professional address and put in the body a small personalized message.

If you have already created and validated an account

Please enter your public name or email then your password from the User menu then Login.
Notes:
- Make sure to use a valid email address. On rare occasions, we will need to contact certain users.
- Due to data quotas and caring use, the creation of only one account per user is authorized.
- Please respect the ChemFlow user charter.

Discovery of the environment

Its interface looks like this:

  • the panel on the left is a menu with links to chemometric tools / functions, sorted by categories
  • the one on the right, a history containing the data imported or generated by the tools
  • and in the center, a display of the interface of the selected tool or data from the history different data formats can be imported: csv, mat, rdata, via a URL.

Data

Data loadding

  1. Data loadingShared data menu then data libraries and choose the Training library
  2. Select the files (fromage.csv, Xnir_mais.csv) and import them into your history by checking the corresponding boxes.
  3. Click on to history, choose the “unnamed history” history then click on Import.
  4. Return to the home page by clicking either on the Galaxy / ChemFlow logo at the top left or on the Data Analysis men

Description of data

The file fromage.csv contains daily chemical analyzes (pH, humidity, fats, proteins and calcium) carried out for 15 days on a cheese. The file Xnir_mais.csv contains 80 near infrared spectra (700 wavelengths, from 1100 to 2498 nm) of corn samples.

Data actions

The data in the fromage.csv file takes the following form:

  • a number indicating in which order (or which step) this dataset was created
  • the name of the dataset
  • Click on the view button (eye icon, view) to display the content of the dataset in raw format in the browser
  • Click on the edit button (pencil icon, edit) to modify the properties of the dataset
  • Click on the delete button (cross icon, delete) to delete all the data in the history.
  • Click on X deleted below the name of your history to make the deleted files reappear. Go to the deleted file and cancel its deletion Undelete it.
  • Click on the name of the dataset and the view widens to show more details:
    • Visualize the data set information: the number of lines (including the head) then the format (possible format: tabular, rdata, mat, pdf, html)
    • Click on the floppy disk icon to save the data on your PC and view it with a spreadsheet or text editor.
    • Click on the symbol i: the description of the function which made it possible to obtain the data
    • the recycling sign or the restart symbol: allows you to re-execute the function with the same parameters
    • the logo of a histogram allows you to create a graph with the output data. The graphics offered are not necessarily well suited to chemometry. This is why we have developed graphic tools specific to our application in the Plot section in the list of tools (left column)

Data display

- Click on the eye to display the 2 datasets.
- Activate the scratchbook mode by clicking on the grid icon at the top right. This mode is very useful when you are in the Firefox browser.
- Click on the eye to display the 2 datasets. How many samples and variables are there in the Xnir_mais.csv file?
should have a structure like the ones above. The first line must contain names relative to the columns (variable names) and the first column must contain names relative to the lines (sample names).

Runing a tool

- Run the Statistics / Mean tool with the fromage.csv file:

  • Dataset? cheese.csv
  • select all columns of the dataset? yes
  • compute the mean by a column factor? no

The History panel shows the status of your data. Several "states" exist:

  • gray color: awaiting execution
  • yellow color: running
  • green color: work finished ok state
  • red color: work finished error state
  • blue color: interrupted work

- Visualize the output of this mean on fromage.csv tool

Graphics

From the fromage.csv dataset.

Histogram

Create a histogram of the pH variable
Use tools / plot / histogram and fill in the following options:

  • Dataset? cheese.csv
  • Column for x axis? c3: pH
  • Plot Title? Histogram of pH
  • Bin Width? 1
  • use a column of a dataset as bar color? no

Redo a boxplot of the variable pH using a bar width of 0.1. Use the "recylcling" button under the file created previously and change the following option:? Bin Width? 0.

Boxplot

To be tested with the same data set.

Scatter plot

Create a scatter plot matrix of the variables: pH, humidity, fat, protein and calcium
Use tools / plot / Scatter plot matrix and fill in the following options:

  • Dataset? cheese.csv
  • Column for x axis? c3: pH c4: humidity c5: fat c6: protein c7: calcium
  • Plot Title? Correlation plot

Create a scatter plot of the pH variable as a function of the number of day of analysis.
Use tools / plot / scatter plot and fill in the following options:

  • Plot Title ? pH=f(jour)
  • label for x axis ? jour
  • label for y axis ? pH
  • series/plot type ? Line and points
  • series/plot type/Dataset ? fromage.csv
  • series/plot type/column for x axis ? c2 : jour_prod
  • series/plot type/column for y axis ? c3 : pH

Create a scatter plot of the variable pH as a function of the calcium concentration
Use tools / plot / scatter plot and fill in the following options:

  • Plot Title ? pH=f(calcium)
  • label for x axis ? calcium
  • label for y axis ? pH
  • series/plot type ? Points
  • series/plot type/x-Dataset ? fromage.csv
  • series/plot type/column for x axis ? c7 : calcium
  • series/plot type/y-Dataset ? fromage .csv
  • series/plot type/column for y axis ? c2 : pH

Create a scatter plot of the humidity variable according to the sample label
Use tools / plot / scatter plot and fill in the following options:

  • Plot Title ? humidite=f(echantillon)
  • label for x axis ? echantillon
  • label for y axis ? humidite
  • series/plot type ? Lines and Points
  • series/plot type/Dataset ? fromage.csv
  • series/plot type/column for x axis ? c1 :
  • series/plot type/column for y axis ? c4 : humidite
  • series/plot type/ use first column as sample label ? yes

Spectra plot

From the Xnir_mais.csv dataset. View the spectra in the form of a graph, use tools / plot / spectra plot and enter the following options:

  • Plot Title ? Spectra Plot de Mais
  • label for x axis ? longueur d onde
  • label for y axis ? Absorbance
  • spectra/Dataset ? Xnir_mais.csv

Histories

All users have a current History, which can be thought of as a working space or directory. The History panel displays the output datasets in the order in which they were created from most recent (top) to oldest (bottom).
History can also store information outside of the datasets it contains. This history can be named / renamed, tagged, and annotated.

Rename and create history

1. Rename its right History panel. The current history is named `` Unnamed history '' by default:

  1. Click on the existing name. A text entry field appears with the current name.
  2. Enter a new name: "training"
  3. Press the Enter key on your keyboard to save the new name. The entry field disappears and the new name is displayed.
  4. To cancel this action, press Esc or click outside the input field.

2. Create a new history and rename the "data transfer"

  1. Click on the wheel icon at the top right of the home page.
  2. Click on Create New Histor

Sharing data between logs

Share data from the "training" history to the "data transfer" history.

  1. Click on the wheel icon at the top right of the home page.
  2. Click Copy Datasets
  3. Select the "training" history which contains the data you want to transfer.
  4. Select the Xnir_mais.csv data of this history that you want to transfer.
  5. Select the "data transfer" history in which you want to transfer this data.
  6. Click on copy History Items
  7. Refresh your history by clicking on the link shown at the top in green: N datasets copied to 1 history: data transfer

Organize your history

Solution 1
  1. Click on the wheel icon at the top right of the home page.
  2. Click on Saved Histories (Here you can permanently delete your history)
Solution 2

To manage your history, click on the book-shaped icon at the top right of the home page. You should see all of the (not deleted) logs you have created.
click the Switch to button at the top of this history, then click Done at the top left to return to the data analysis page.
stores your history data in files. The total size is displayed below the history name. If you reach the quota (by default you have the right to 10 GB), you prevent new data from being recorded and therefore from operating the tools. You then need to delete some datasets. You can also save your data on your computer using the floppy disk icon when you click on your history data.

Share and Recover History Data (DEMO)

This feature will be used to share a history with one or more users
  1. Click on the wheel icon at the top right of the home page.
  2. Click Share or Publish.

To share it with all users:

  • Click on Makes the history accessible via link

To share it with one or more user(s):

  • Click on Share with a user
  • Enter user email addresses

To share it with all chemflow users: :

  • Click on Makes the history accessible and Publish: this will publish the history in the Published Histories section of Galaxy, where it is listed and searchable.
This feature will be used to retrieve a history shared with all users

Shared Data menu then Published Histories and choose the history you want to view then import if you wish. Visualize the data you want to see using the eye icon: the data will be opened in a new tab. To import this history, click on the import history button then choose the name of this history or leave the default title. Return to the home page by clicking either on the Galaxy / ChemFlow logo at the top left or on the Analysis Data menu. This feature will help you recover shared data

Export / save your chemflow history to your PC

The history can be exported from the "cogwheel" icon - "export history to file" then right click on this "Open in a new tab" link to save it on your PC. Then click with decompression software such as 7zip then click on the ".dat" data, then the "datasets" directory and finally extract in your desired directory.

This feature will be used to retrieve a history that a user to share with you

Click on the wheel icon at the top right of the home page.

Click on Histories Shared with me.

Importing YOUR data

The expected input file format in ChemFlow is of the form below, the first row should contain column-relative names (variable names) and the first column should contain row-relative names (sample names) :

  • text file readable by a spreadsheet or text file editor; for example the csv format (comma separator, semicolon, tabulation, ...)
  • name of mandatory and unique rows and columns
  • line name (sample names): no scientific writing, no special characters (for example °)
  • no sample label repeats
  • no missing values
  • no tabulation in names
  • point as decimal separator

Use the Import Data / Upload File tool:

  • Choose local file, Type
    • the file extension is not important what is important is to know which separator we used.
    • If you have used tab separators, we choose “tabular” type if another type of separator, we choose “csv” type.
  • Start and Close

To check the correct import, view the format of your file and its content in scratchbook mode. If you have imported with type "csv", you must use the tool "convert format data / Convert delimiters to TAB" to convert the separators into tab separator (tabular format) by specifying what type of separator you have used in your file in the "Convert all" parameter. In the event that you have imported data specifying type “csv” but your file contains “tabulation” separators. You must also use the "convert format data / Convert delimiters to TAB" tool and select "whitespace". Warning, if the 1st line * 1st column is empty, set the option "Strip leading and trailing whitespaces" to "no". 2 files expected in ChemFlow:

  • 1 spectrum csv file
  • 1 csv file with properties

If you have repeats, the spectrum identifiers (= row name) must still be unique. In this case, you can, for example, add the name of the repetition after the name of the sample (ex: sample name: A021a; if we have 2 repetitions, we will name them A021a-1 and A021a-2) . You can also average your spectra, for that you must have 2 files:

  • a file with the name of the unique samples and its spectrum
  • and a factor file with the name of the group or the name of the sample (not necessarily unique)

Then insert them in the "Mean" function of the Statistics tools. As for the samples of properties, it will be necessary to duplicate them, with the same values, in order to respect the same number of lines as in the table of spectra. Process the data of a project: history management: create a history for each goal workflows: create workflows from your history and share them.

Automate tasks, make code, workflow

Several ways to create a code, which we will call in chemflow, a workflow:

  • transform your history into a workflow, from ChemFlow, click on the cogwheel from the history frame at the top right and click on extract workflow.
  • create a workflow from zero using the workflow editor: Workflow menu - Create New workflow button

Help :

PCA and PLS

PCA is an unsupervised exploratory statistical method (1 table) for:

  • explore and describe data
  • clean the data (outliers, extreme points)
  • identify groups of individuals
  • determine and quantify the links
  • preliminary step before another analysis.

On what type of data?

  • continuous or discrete quantitative data
  • on an array of variable dimension
  • noisy data
  • possible on a few NA (missing data in small quantities)
  • strong correlation between variables

PLS, Partial Least Squares / Projection on to Latent Structures, regression is a supervised statistical method for predicting quantitative information, predicting the state (a biological value) from the spectra-measurement pair. It allows simultaneous modeling of the variability X and Y and also maximizes the correlation between the latent variables.

Description of the dataset

The data were produced by the University of Aix-Marseille, team of Nathalie Dupuy. Near infrared analyzes were carried out on 187 olive oils. They understand :

  • a set of 187 spectra comprising 612 wavelengths, 1000 to 2222 nm: pir.csv;
  • a reference data set containing concentrations of different fatty acids and a coding of these 187 samples according to 6 geographical origins: 1 = Aix en Provence, 2 = Haute Provence, 3 = Nice, 4 = Nîmes, 5 = Nyons, 6 = Baux de Provence Valley: ags.csv.

Loading data

In ChemFlow, create the ACP-PLS history from the history frame at the top right. From the directory chemflow / shared data / data libraries / Formation, import the spectra, pir.csv, as well as the biochemical values ags.csv. You can also find this data here from the "ChemData" - "Data" menu.

Data visualization: exercise 1

  • View the number of lines in the files. How many samples are there?
  • View the data as a data table. How many variables are there? In ChemFlow click on the eye (activate scratchbook mode if necessary).
  • View the spectra in the form of a graph and download this graph. Use tools / plot / Spectra plot, and fill in: dataset -> pir.csv
  • Visualize the spectra by assigning a color to each origin. Use tools / plot / Spectra plot, and enter:
    • dataset -> pir.csv
    • use a column of a dataset as spectra color -> yes
    • dataset -> ags.csv
    • column for color -> c17 : origine

Carrying out a PCA: exercise 2

  • Apply an SNV pre-treatment on your spectra. Use tools / Pretreatments / SNV, and enter: Select X data -> pir.csv
  • Perform a PCA on the calibration data set with prior centering, without normalization.

Use tools / Calibration / Exploration / PCA, and fill in:

  • Select X data -> xCal(SNV(pir.csv))
  • Centering option -> yes
  • Scaling option -> no
  • Number of PC max -> 20
  • Visualize the numerical values of the percentage of variance explained as a function of the number of principal components.
  • Draw the diagram of explained variance (= scree). What do we observe? Use tools / plot / barplot, and enter:
    • dataset ! pca explained variance ( %) : xCal(SNV(pir.csv))
    • label for x axis ! main component number
    • label for y axis ! percentage variance
  • Draw plans 1-2 and 3-4 of the factorial map of the individuals (score plot). To plot the factorial plans: The scores are in the history: pca scores: xCal (SNV (pir.csv)). you must use tools / plot / scatter plot and enter the following options:
    • label for x axis ! pc 1 | label for y axis ! pc 2 | series/plot type -> points
    • series/plot type/x-dataset -> pca scores :xCal(SNV(pir.csv))
    • series/plot type/column for x axis ! c2 : pc1
    • series/plot type/y-dataset -> pca scores :xCal(SNV(pir.csv))
    • series/plot type/column for y axis -> c3 : pc2
    • series/plot type/use rst column as sample label -> yes
    • series/plot type/use a column of a dataset as point color -> yes
    • series/plot type/use rst column as sample label/dataset -> yCal(ags.csv)
    • series/plot type/use rst column as sample label/column for color -> c17 : origin

Performing a PLS regression: exercise 3

  • Use PLS regression to build a calibration model from the calibration set (xCal (SNV (pir.csv), yCal (ags.csv)). Use tools / Regression / PLSR, and enter:
    • Select X data -> xCal(SNV(pir.csv))
    • Select y data -> yCal(ags.csv)
    • Column of y data chosen for the calculation -> c2 :C.16
    • Algorithm choice -> NIPALS
    • cross-validation type -> k-bloc cross-validation
    • Number of blocs for cross-validation -> 4
    • Number of latent variables(LV) -> 20
    • Centering option -> yes
    • compute outliers statistics(T2,Q,yresiduals) -> no
  • View and plot the values of RMSEC-RMSECV versus the number of latent variables. You must use tools / plot / scatter plot and enter the following options:
    • label for x axis -> Number LV
    • label for y axis -> RMSEC-RMSECV
    • series/plot type -> Line and Points
    • series/plot type/Dataset -> NIPALS-PLS on (xCal(SNV(pir.csv)) ;yCal(ags.csv)(c2)) :rmsec-secv
    • series/plot type/column for x axis -> c1 :
    • series/plot type/column for y axis -> c2 : rmsec et c3 :rmsecv
    • series/plot type/use column name(s) as legend -> yes
  • Then apply this model on the validation set (xVal (SNV (pir.csv), yVal (ags.csv)). Use tools / Regression / Apply a regression model, and enter:
    • Select the regression model ! NIPALS-PLS on (xCal(SNV(pir.csv)) ;yCal(ags.csv)(c2)) :mo-del
    • Number of latent variables if PLS or PCR model are used -> 20
    • Select X data -> xVal(SNV(pir.csv))
    • have you reference data of X data -> yes
    • Dataset containing the reference value of X data -> yVal(ags.csv)
    • Column of Dataset chosen for the calculation -> c2 :C.16
  • Visualize and plot the RMSEP as a function of the number of latent variables. you must use tools / plot / scatter plot and enter the following options:
    • label for x axis -> Number LV
    • label for y axis -> RMSEP
    • series/plot type -> Line and Points
    • series/plot type/Dataset -> RMSEP of xVal(SNV(pir.csv)) from NIPALS-PLS on (xCal(SNV(pir.csv)) ;yCal(ags.csv)(c2)) :model
    • series/plot type/column for x axis -> c1 :
    • series/plot type/column for y axis -> c2 : rmsep
  • Plot the values predicted by PLS with the optimal number of latent variables against the observed values. The predicted values are in the history: ypred of xVal (SNV (pir.csv)) from NIPALS-PLS on (xCal (SNV (pir.csv)); yCal (ags.csv) (c2)): model. you must use tools / plot / scatter plot and enter the following options:
    • label for x axis -> yref
    • label for y axis -> ypred
    • series/plot type -> points
    • series/plot type/x-dataset -> yVal(ags.csv)
    • series/plot type/column for x axis -> c2 : C.16
    • series/plot type/y-dataset -> ypred of xVal(SNV(pir.csv)) from NIPALS-PLS on (xCal(SNV(pir del
    • series/plot type/column for y axis -> cn : LVn-1
    • series/plot type/use first column as sample label -> yes
    • series/plot type/add bissectrice and statistics parameters -> yes

Turning your history into a workflow: exercise 4

In ChemFlow, click on the cogwheel from the history frame at the top right and click on extract workflow.

Conclusion

Along with this guide we advise you:

  • follow the video tutorials dedicated to ChemFlow
  • to follow the “Interactive Tours” of Galaxy to better understand the software.
  1. Go to the Help tab then on Interactive Tours
  2. Choose one of the two History Introduction or Galaxy UI tutorials.

Launch it by clicking on it and let yourself be guided.

Good luck and have fun on ChemFlow!