15 Replies Latest reply on Dec 8, 2010 4:28 PM by Geoffrey Hynes

    Successful importation of datasets

    New User

      Hi Bob,

       

      Thank you again for the importation manual. I have now successfully imported small sections of PhyChem/Biodeg data from my database and realised where I was making simple but very important mistakes. It may be just me, but I thought I'd share this information as it was one of the reasons for successful importation or no importation.

       

      When selecting DATA, you need to make sure that you define the values. These are the DATA-Mean Value Scale/va (could also be DATA-Low Value) and then select for DATA-Unit.

      I've attached a MS Word document to show this.

       

      In addition, I've also imported using the complete string:

      Environmental Fate and Transport#Biodegradation#Biodegradation in water: screening tests#% Degradation

       

      and by separting out the string:

       

      Environmental Fate and TransportBiodegradationBiodegradation in water: screening tests% Degradation

       

      Both methods have imported the database correctly.

       

      In addition, there are quite a few changes in the terminology for assays, so for a 550 database set this will take some time to correct and update.

       

      e.g.

      TB Version 2:

      Human health hazards#Genetic Toxicity#in vitro#bacterial reverse mutation assay (e.g. Ames test)#Gene mutation#Salmonella typhimurium#without S9#TA 98

       

      TB Version 1:

      toxicoloical Information#Genetic Toxicology (mutation and chromosome aberrations)#In Vitro#Ames_Mutagenicity#Ames Mutagenicity without S9#rat#AMES-Salmonella Typhimurium TA 98

       

      I hope this helps.

        • Re: Successful importation of datasets
          83059 Expert

          Excellent. Good to see that it works. Will add your clarifications to the manual.

          thanks

          Bob

            • Re: Successful importation of datasets
              New User

              Bob/Geoff,

              I think there are still some issues on this one.

              In general it is relatively easy to get a dataset to import but getting it to exactly the correct endpoint path, especially for long paths, is still tricky. So for example, although Geoff was able to import using the entire string Environmental Fate and Transport#Biodegradation#Biodegradation in water: screening tests#% Degradation I have been unable to get similar results with, "Ecotoxicological Information#Aquatic Toxicity#Mortality#EC50#48 h#Animalia#Arthropoda(Invertebrates)#Branchiopoda(branchiopods)#Daphnia magna".

              I can get a successful import by just using the Defined portion, (Ecotoxicological Information#Aquatic Toxicity), but then have been unable to define the rest of the fields using the metadata tags, partly because it is not clear which tags refers to which points on the endpoint path, and partly because of the seemingly inconsistent behaviour of the import process, as below.

              I have tried to follow the step by step instructions in the revised pdf for horizontal import of the Ecotoxicological example. When assigning metadata tags, for example Duration Unit as in the example on page 30 of the revised import wizard pdf, I cannot assign more than one column to the metadata tag "duration

              In the example given in the pdf, the duration tag can have the value of "Mean value/Scale value" or "Unit" depending on the radio buttons selected. I cannot get these tgs to appear when I select the appropriate button, so can only define one field for duration. This means that when the data loads, it loads in an endpoint path that has "undefined duration" as part of its path. It maybe something to do with the "Set tree hierarchy" feature mentioned in the pdf but this does not appear to be very clearly explained.

              In any event it appears that this is a problem even within the databases included in the Toolbox. I have noted some data entries which have an "undefined" element in their endpoint path, when I suspect they should appear in a fully defined path. If this is the case there may be significant amounts of data which cannot be used for read across because their endpoint paths have not been defined properly or the data in the original spreadsheet is not in the exactly correct format. Import would still appear OK as the "Import Successful" message does not necessarily mean that import has been successful. I may be completely wrong about all of this but I'm struggling for explanations for my own failure to import the data into the "correct" endpoint path. I have attached an extract of the data I'm trying to import, so if anyone can give me some tips I'd be most grateful.

               

              Nick

                • Re: Successful importation of datasets
                  588921 Novice

                  Hi Nick,

                   

                  I've written a response. It has some pictures in it so I've attached it as a Doc file.

                   

                  Regards,

                  Georgi, LMC Team

                    • Re: Successful importation of datasets
                      New User

                      Hi Nick,

                       

                      There does still seem to be some fundamental issues with the database importation wizard which I thought was linked to the exact matching of the tree path. However, after Nicks comments I have gone in and looked at a specific endpoint which I know has limited data.

                       

                      For cyclophosphamide (CAS 50-18-0), the mouse lymphoma test is classified under the following tree path and is underfined.

                      Human health hazards#Genetic Toxicity#in vitro#Undefined Test type#Gene mutation

                       

                      Whereas, for 2-aminoanthracene (CAS 613-13-8), the correct following tree path is supplied.

                      Human health hazards#Genetic Toxicity#in vitro#mammalian cell gene mutation assay#Gene mutation#Mouse Lymphoma cells

                       

                       

                      However, using this correct tree path by coping directly from the TB, does not seem to mean that it will correctly import as expected.

                       

                      This procedure worked for:

                       

                      Environmental Fate and Transport#Biodegradation#Biodegradation in water: screening tests#% Degradation

                       

                      But not for:

                       

                      Human health hazards#Genetic Toxicity#in vitro#mammalian cell gene mutation assay#Gene mutation#Mouse Lymphoma cells

                       

                      As this was literally copied directly from the TB, pasted into Excel and then imported straight back in, there seems to be an issue with the TB. So I’m now not sure if I’m any further forward.

                       

                       

                      Hi /Georgi,

                       

                      I will review your additional information and will hopefully be successful.

                       

                      Best Regards, Geoff...

                      • Re: Successful importation of datasets
                        New User

                        Georgi,

                        Thanks for the additional information. Can I take your points in turn:

                        1. The problem with Duration. I fully understand the significance of the difference between "define new Region" and "Metadata". However your comment ;

                        "Note that “Is value” has some particular behavior. It reacts when clicked but does not update properly when other column is set. As a rule of thumb when you select another column you should assume that it does not properly show its “is value” status and explicitly check/uncheck it." does answer my point re- multiple assignation of metadata tags "though particular behaviour" is a quaint way of putting it. Using your "check/uncheck method works, thanks. However, this fix seems to fail if all compounds are removed from the active window and a new set loaded. The program has to be rebooted for this fix to work again.

                         

                        2. The endpoint tree. Again I've read the wizard pdf and I'm aware of how the data fields are made up from Defined regions and metadata. My point was not which metadata fields are displayed in the toolbox but rather how to get the data loaded into the "correct" endpoint path. For example, how do the metadata tags code for "Kingdom" "Phylum etc", because if only the species metadata tag is used the data will be dispayed as "Unknown Kingdom", "Unknown Phylum" etc.

                        You say "

                        The Animalia#Arthropoda(Invertebrates)#Branchiopoda(branchiopods) part is a separate feature in which Kingdom#Phylum#Class information is inserted before the field Test organisms (species)."

                         

                        I'm not sure what this means. I've tried using the "Daphnia magna" part as my species metatag but, not surprisingly the data ends up in an "undefined kingdom/undefined phylum/undefined class" path. Also I've tried incorporating the kindom/phylum/class columns within my "Species" metatag, but with the same result. Since you have used my example small data set in your reply, maybe you could tell me if you successfully imported that data to the path Ecotoxicological Information#Aquatic Toxicity#Mortality#LC50#48 h#Animalia#Arthropoda(Invertebrates)#Branchiopoda(branchiopods)#Daphnia magna

                        3. Your comment about "consistent visual experience" misses the point. If all data from the same test/species/duration/etc. from all datbases are not defined and metatagged consistently then when the user forms groups for read across he/she will not have access to all the available data because some will contain one or more "undefined" fields as indicated by Geoff in his recent post. This will diminish the value of the Toolbox as a predictive aid since you need as much data in a category (group) as possible to improve the probability of a correct prediction. I suspect that quite a few datapoints from databases provided with  the toolbox are not properly assigned to their "correct" endpoint path.

                         

                        Nick

                          • Re: Successful importation of datasets
                            588921 Novice

                            Nick,

                             

                            1. I guess this is a bug. I will check this and make sure it is fixed in the next release.

                            2.

                            I've tried using the "Daphnia magna" part as my species metatag but,

                            You shoud tag the "Daphnia magna" column as "Test organisms (species)". Then the Toolbox engine will put the Kingdom#Phylum#Class information.

                             

                            I did successfully import your example file with no problems. I've attached a screenshot with the designations I've used.

                             

                            3. You are right. Right now the Toolbox offers the flexibility to import any data to any metadata field. You could import Daphnia magna to a field called Duration for instance which will then look off when the dynamic tree is built. Additional restrictions might be in order but I do not have additional information at the moment.

                             

                            Georgi

                              • Re: Successful importation of datasets
                                New User

                                Georgi,

                                Many thanks for the rapid reply. SUCCESS at last.

                                I do wonder though if it might be unfortunate that the metatag label "Species" is not the one required to label the species column, but that "Test Organism (species) is the correct one. I think a full list of the metadata tags and their use context would be very useful.

                                Thanks also for the comments regarding the implementation of metadata in the Toolbox itself. I'm sure it would be a monumental task to check it all but, for instance I have found some examples where Ames tests using S.typhimurium TA100 appear in the tree path as being "Undefined Test organisms (species)". Also 878 data points from the OASIS Genotox database appear under "Human health hazards#Genetic Toxicity#in vitro#in vitro mammalian chromosome aberration test#Chromosome aberration#Undefined Test organisms (species)#without S9", but according to the exported endpoint data appear to be from Chinese Hamster lung cells.

                                • Re: Successful importation of datasets
                                  New User

                                  Hi Georgi,

                                   

                                  The complexity seems to have increased substantially when importing proprietary databases. This was very simple, but affective in version 1 of the Toolbox.

                                   

                                  I understand your comments and hence Nick's success, although I haven't tried this yet myself

                                   

                                  However, can I ask why when the tree path is copied from the Toolbox and then pasted directly into Excel (i.e. Human health hazards#Genetic Toxicity#in vitro#mammalian cell gene mutation assay#Gene mutation#Mouse Lymphoma cells), why does it not import correctly?

                                   

                                  Cheers,

                                  Geoff...

                                    • Re: Successful importation of datasets
                                      New User

                                      Geoff,

                                      Pre-empting Georgi's reply, I presume the answer is that only the fields "Human health hazards" and "Genetic Toxicity" are recognised as legitimate primary fields in the database, whereas the remainder (#in vitro#mammalian cell gene mutation assay#Gene mutation#Mouse Lymphoma cells)) is only recognised if defined by the metadata tags.

                                       

                                      Nick

                                        • Re: Successful importation of datasets
                                          New User

                                          Hi Nick,

                                           

                                          I agree, however I'm interested and wondered why the data in example 1 goes in correctly, but data in example 2 doesn't.

                                          From Georgi's comments, the metadata (highlighted tree path) should need to be defined for both.

                                           

                                          1). Environmental Fate and Transport#Biodegradation#Biodegradation in water: screening tests#% Degradation

                                           

                                          2). Human health hazards#Genetic Toxicity#in vitro#mammalian cell gene mutation assay#Gene mutation#Mouse Lymphoma cells

                                           

                                          I'm assuming that you have now separated you database out into individual Excel cells instead of all in a long string in a single cell?

                                          If that's the case, I may revert to my original set-up and try this again selecting all the metadata.

                                           

                                          Cheers,

                                          Geoff...

                                            • Re: Successful importation of datasets
                                              588921 Novice

                                              Geoffrey,

                                               

                                              The import works on leaf node from the  predefined tree (the 1st path) and not on dynamic path (the 2nd one). You could see which is which if you press the Ctrl key - this will underline the predefined part of the tree(see attached file).

                                              If you want to see what defines the Dynamic part you can click on the Human Health Hazards#Genetic Toxicity and you will see what metadata fields are used to define the hierarchy.

                                               

                                              Georgi

                                              • Re: Successful importation of datasets
                                                New User

                                                Hi Geoff,

                                                I've completely deleted the "Mortality#LC50#48  h#Animalia#Arthropoda(Invertebrates)#Branchiopoda(branchiopods)" columns from my spreadsheet. All that is needed is the column containing the predefined region, "Endpoint path" (in my case it's "Ecotoxicological Information#Aquatic Toxicity") and the species column (Daphnia magna). So long as my species column is metatagged as "Test Organism (Species)", the Toolbox fills in the rest. Of course I still need the columns for duration, units etc.

                                                Nick

                                                  • Re: Successful importation of datasets
                                                    New User

                                                    Hi Nick/Georgi,

                                                     

                                                    Success.

                                                    Using the crtl key to see the predefined tree path for each parameter helps as per your previous emails (I'd forgotten about this). Then defined the dynamic tree path in individual cells and as long as these are the same as in the additional information guide Georgi sent, everything links in nicely.

                                                    I still think this is overly complicated compared to version 1 of the TB, but it now works.

                                                     

                                                    I was beginning to think I'd need separate databases for the 4 predefined areas, but not now, which is a major bonus.

                                                     

                                                    After a lot of work, cheers all,

                                                    Geoff…