From: ILPI <info**At_Symbol_Here**ILPI.COM>
Subject: [DCHAS-L] The rebirth of the DCHAS-L archives
Date: January 16, 2012 10:59:41 PM EST
Reply-To: DCHAS-L <DCHAS-L**At_Symbol_Here**MED.CORNELL.EDU>
Message-ID: <6AA213D3-306F-4831-A7A2-50BD7335541A**At_Symbol_Here**ilpi.com>


The DCHAS-L archives from present back to 2003 are now available through the DCHAS home page, http://dchas.org/?device=desktop (desktop) or http://www.dchas.org/ (mobile version).  The archives are actually hosted on my company's server at http://www.ilpi.com/dchas/ because of the logistics and overhead this takes to operate; you can access them from either site, of course.


Some of you may not be aware that the archives went dark when they were moved from UVM to Cornell.  Further, the Cornell listserv does not make archived posts publicly accessible.  Therefore, we needed to find a solution that would ensure that this compendium of safety knowledge remains available to all and is not software-dependent.

I doubt anyone can appreciate what a monumental challenge this was so I'd like to give you a feel for it...

Ralph Stuart provided me with the monthly archive files.  These are long documents that are tedious to scroll through.  Moreover, some posts are in pre-formatted text, others in HTML, others in Microsoft's idea of HTML (which has so much spinach it's not funny), and still others were in base64 encoding which is not human-readable. And remember, these are all jumbled together in one document.  Overall, it would be pointless to post these as is.

I wrote some programs to extract the individual posts into their own separate files, decode the base64 elements, pick out the text or html elements, and extract the header information such as Date, Subject, From, etc.  Overall, there ended up being 6,147 files for the period 2003-to date, taking up 58 MB of disk space.

I then wrote some more programs to add Previous/Next information to each post and to generate the yearly indices and then applied various formatting tricks.  I then had to  hand tweak a few hundred files because the archives are full of artifacts that interfere with automated processing.  

To make the archives useful, we needed a search capability which my company has built into its web server, so I posted them on ilpi.com where we could generate the search index and serve search queries.

Overall, it works great.  Some features I'd like to point out are:

1. All email addresses are cloaked to prevent web-scraper robots from harvesting your email address.  The "**At_Symbol_Here**" symbol has been replaced *At_Symbol_Here** so humans can still read and use the email address.

2. Posts that are older than 5 years are automatically flagged with a cautionary note that the information in the post may be outdated.

3. It's wicked fast.  You can page between posts lickety-split.

4. There are still stray display artifacts and occasional extraneous characters (I did not review all 6,147 files, but I did look over a fairly large sample) but for the most part all of the posts should be human-readable.  If anyone finds one that is horridly run together, blank, or gibberish, email the URL to me or Ralph off-list so I can repair it.

5. I did not create threading (which lets you see the replies specific to each post), primarily because the thread data was missing from most of the posts.  I would have had to try and fake it based on the Subject lines, and that is just not realistic.

6. While this is invisible to the user, the site navigational elements, search bar, footers, and most of the text formatting options actually reside in single master files, which will allow us to change any one of those elements in one location and have it applied across the entire site.  Very handy, that one.

Moving forward, we should see some nice improvements to the archive as I will be capturing the complete raw output from the list server instead of the truncated information we had in the old archive files.  Once I write some more scripts we should be able to have:

1. Perfectly preserved formatting without artifacts - all fonts, colors, indents, etc. should show up as they originally appeared.

2. Subject threading as I will be capturing the thread data as the posts appear.  It may take me a while to get this operational; we'll see.

3. Posts appearing in the archive in real time - or at the end of the day at the latest.

My hands are sore from days of coding and tweaking, and my arm is sore from patting myself on the back (grin).

I welcome any and all constructive criticisms/feedback.  Thanks again to Ralph for making the data available and for his thoughts/comments on the design/interface.

Rob Toreki

  ======================================================
Safety Emporium - Lab & Safety Supplies featuring brand names
you know and trust.  Visit us at http://www.SafetyEmporium.com
esales**At_Symbol_Here**safetyemporium.com  or toll-free: (866) 326-5412
Fax: (856) 553-6154, PO Box 1003, Blackwood, NJ 08012

Previous post   |  Top of Page   |   Next post



The content of this page reflects the personal opinion(s) of the author(s) only, not the American Chemical Society, ILPI, Safety Emporium, or any other party. Use of any information on this page is at the reader's own risk. Unauthorized reproduction of these materials is prohibited. Send questions/comments about the archive to secretary@dchas.org.
The maintenance and hosting of the DCHAS-L archive is provided through the generous support of Safety Emporium.