Do we need a historian? It's pruning time.

Nasty · Aug 13, 2005

Okay...maybe we don't need to back up the Cove...

I have built and run lots of websites...I would be more than happy to help with this if Spark could give me the data files. I think some of us could contribute something to the library building fund and maintenance thereafter. It'd just be a data dump...a reading room only and would *not* compete with BF.

Even if we don't run it on the fastest server...we'd understand that it takes time to research the good stuff.

.

olpappy · Aug 13, 2005

Besides having it mirrored on a web site, which is a good idea, wouldn't a simple solution be to burn the files to a DVD? I believe a DVD can hold 4.7 GB? I don't have a DVD burner on my computer, but I'm sure a lot of folks do. Maybe Sparks can help with this, since he's going to have to access the files to delete them anyway.

Vampire Hunter D · Aug 13, 2005

I like Nasty's idea the best. Just purchase the liscense for our own VBulletin site, plus the web hosting of course. It's just a matter of money. $250 up front, plus the $100 a year. That sucks. This whole situation sucks.

BTW, do NOT go to thecantina DOT com to check if it's available. The site contains porno content and I know some of you guys read this at work or around kids.

- D

Khukuri Monster · Aug 13, 2005

Right now I'm working on a little script that anybody here can download and use to save to their computer arbitrary numbers of old threads from any forum on BF. These files could be packaged up and distributed relatively easily. I would be able to host them on an ftp server once I get back to school.

I might be able to finish it tonight, however it might take longer since I'm using a relatively unfamiliar language and have to learn how to do some new stuff. Which is not a bad thing anyways, since I need the practice.

Nasty · Aug 13, 2005

Lots of GREEN Rep points for you if you can get it done KM...LOTS

.

Dave Rishar · Aug 13, 2005

+3 Spirit for KM. That's +6 total.

I'd think that if they were saved in plain text (with all the graphics and avatars and other extraneous BS that comes with web-based forums) a few years' worth of archives would be very small indeed...probably small enough to fit on a DVD, as Olpappy said.

We'll see what we see.

Berkley · Aug 13, 2005

FWIW, my email to Spark got bounced, so I posted on the forum. I hope others will do so also:
http://www.bladeforums.com/forums/showthread.php?t=360215

Kismet · Aug 13, 2005

like Berk's, my email bounced. I posted it at that same link.

Nasty · Aug 14, 2005

Done...I sure hope he sees it and can help..

.

Yvsa · Aug 14, 2005

Congrats guys!!!!

Y'all bought some time for us anyway.

:thumbup:

A friend of mine put some info from rec:nativeamerican in a ZIP file for me way back when I was on a WebTV unit, if you can find that info helpful. If it can't be downloaded onto a CD perhaps it can be downloaded into a ZIP until it can be transfered to a CD?
I know nothing about what y'all are trying to physically do.

Kismet · Aug 14, 2005

"chit chat" ???

What does this have to do with khukuris?

Everything.

Khukuri Monster · Aug 14, 2005

I'm almost done with my script. I think it may be working but I'm testing it now to make sure. At least it is grabbing all the thread numbers and pages properly.

It looks like I might even be able to run it on my dial-up connection, might take all night but it would work.

GAAAH why does Windows have to be stupid and use BACKSLASHES in directory paths.....???? unlike normal operating systems which use forward slashes. Everyone knows that the backslash is supposed to be an escape character...

Khukuri Monster · Aug 14, 2005

Kismet said:
"chit chat" ???

What does this have to do with khukuris?

Everything.

That's exactly what I was thinking.

Khukuri Monster · Aug 14, 2005

I have a kind of working version, however I'm working on making it user friendly so people can use it. Also there are some problems where you can't see any pictures, and can't use any links in the downloaded pages. It may also be slow and inefficient but that just might be my connection speed.

For instance, here's what one downloaded thread looks like:
http://www-personal.umich.edu/~robgaunt/threadsaver/190935_1.html

here's the script:
http://www-personal.umich.edu/~robgaunt/threadsaver/savethread_0.1.py
you will need Python (www.python.org) to run it, and you need to go into the text file and change some values to be able to choose what you want to download. So at this point I don't recommend downloading it unless you know what you are doing and have some programming experience or want to make fun of my poor programming.

I may also make some sort of filter which modifies the downloaded files so that links to other threads work, and pictures show up, etc. It's not as bad as it sounds, since all that is required is changing some values to let the computer know that the files should be stored on a remote computer and not the same directory that the HTML file is stored in.

It might be worthwhile for more people to try the webcrawler since it would be superior to this method. I will test out the low bandwidth version myself and see if I can give ferguson some help in accessing the files. I suppose I should have done that before trying to hack some program together.

edit at 4:49 pm: Here's a progress report. Everything seems to be working OK. I've downloaded the posts up until 3-21-2005 and it took about 2 hours. That equates to about a month's worth of posts. So to get the remaining 6 months it will take me about 12 hours, or a couple nights of downloading. Not too bad though.

There was a bug in the original script which caused it to miss downloading the last thread on any given index page. That's fixed now, but if anybody tried using it, you have missed a few threads.

I'm going to try to go through the downloaded files and filter out all the extraneous stuff put in there by the forum software. What will be left will be the poster names, dates, and actual content, along with any formatting that was inherent in the post. This should compress them significantly.

Then I'll make some sort of search utility (for those of you who can't grep) and hopefully be able to host this mini-archive so anybody can download it. However most of this will have to be later when I'm back at school. Right now I'm just trying to save all the files, although I still have some hope that Spark won't go and delete them.

It's an interesting project anyways. I always wanted an offline searchable bladeforums database because the search utility on the forums is so horrible.

Khukuri Monster · Aug 14, 2005

I messed around a little bit with QuadSucker and it isn't too complicated. In summary: It works when you set proper link depth and proper start page. But it has some serious problems and won't work well.

The main problem with is that it will download a lot of junk. For instance, it won't just download the threads, but also download multiple versions of the index page (an index page sorted by user name, an index page sorted by last post, etc.). It will also download multiple versions of the same thread so there will be a LOT of bloating and mess.

Here's how I used the program to save some files.

You want to start with a link that indexes all the oldest threads on a given site. So you could use this one:
http://www.bladeforums.com/forums/f...=1&pp=25&sort=lastpost&order=asc&daysprune=-1

or this one, if you want to index more threads at a time (recommended):
http://www.bladeforums.com/forums/f...1&pp=200&sort=lastpost&order=asc&daysprune=-1

Apparently you can't show more than 200 threads on a single page.

Now you will need to go into the settings (under the menu Settings, click Configuration Options).

Under Directory Structure, use "Mirror the Website" (should be default anyway)

Under Download Directory put whatever you want.

Under Spidering, I unchecked the "off-site images" and "off-site pages" options. This ought to reduce the volume of what you are downloading.

Now set the "link depth" to 1. This is so you only download threads on the HI forum and not from every single forum. In specific, setting the link depth to 1 will force QuadSucker to only download stuff which is linked directly from the index page.

Under "Link Relativiser" I think you need to make sure that "Relativize on-site links" is checked. Otherwise the links on the page will not refer to what is on your hard drive, but rather, what is on the Internet. This appears to be the problem that ferguson had.

On the other hand it was working fine for me without relativized links. I think this might be due to the directory structure that is used.

Now, ferguson, you still have the important data on your hard drive, but it is just a bit disorganized. It should still be possible to find any given thread. If you look at the list of files that QuadSucker downloaded, there are probably a bunch of files with names like: "showthread.php_t=198282&page=1". This is the actual content, but the problem is separating, searching, sorting, and distributing this content. I'll have to think on this a little more.

back to the tutorial.

Close the configuration window. Go to the Settings menu again, and click on Priority Keywords. Add the keyword "showthread.php". This makes Quadsucker realize that the threads are the most important content on the page. It will download threads before anything else.

Now when you click "go" the program should start downloading all the threads on the index page you gave it. When it is done you will have to give it another index page. If you put 200 threads per page there are at most 50-some pages that you will have to download to get the entire contents of the HI forum.

But like I said, there will be a great deal of mess and bloat. After testing it out, this program seems less than ideal for this specific application. I'll see if I can find something else.

Here's a list of various ones which may or may not work with Windows:
http://www.manageability.org/blog/stuff/open-source-web-crawlers-java/view

ddean · Aug 14, 2005

Anyone know anything about HTTrack Website Copier?

http://www.httrack.com/

"HTTrack is a free (GPL, libre/free software) and easy-to-use offline browser utility.

It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack arranges the original site's relative link-structure. Simply open a page of the "mirrored" website in your browser, and you can browse the site from link to link, as if you were viewing it online. HTTrack can also update an existing mirrored site, and resume interrupted downloads. HTTrack is fully configurable, and has an integrated help system.

WinHTTrack is the Windows 9x/NT/2000/XP release of HTTrack, and WebHTTrack the Linux/Unix/BSD release."

~
~~~~~~~~~
<> THEY call me 'Dean'

-fYI-fWiW-iIRC-JMO-M2C-YMMV-TiA-YW-GL-HH-HBd-IBSCUtWS-theWotBGUaDUaDUaD
<> Tips <> Baha'i Prayers Links --A--T--H--D

ddean · Aug 15, 2005

If anyone feels like checking out these
mirroring/site-ripping/offline-browsing programs..................

http://www.internet-soft.com/extractor.htm
http://www.spidersoft.com/webzip/de....html?f1=Banner.html&f2=BlackWidow/index.html
& new beta version: http://www.softbytelabs.com/Frames.html?f1=Banner.html&f2=BlackWidow/index.html
http://www.5star-shareware.com/Windows/Internet/BrowsersOffline/BrowsersOffline1.html
http://www.infoclub.com.np/download/intoff.htm

~
~~~~~~~~~
<> THEY call me 'Dean'

-fYI-fWiW-iIRC-JMO-M2C-YMMV-TiA-YW-GL-HH-HBd-IBSCUtWS-theWotBGUaDUaDUaD
<> Tips <> Baha'i Prayers Links --A--T--H--D

Nasty · Aug 15, 2005

I've been downloading the site with httrack for the past 7 hours and 4 minutes now and it seems to be working. I have to leave for work soon,but will let you know atthe end of the work day if it has done what we hope for.

1.2 Gig so far...it seems some of it works properly right now finding pages on my local drive, but others return to the web for the data. Perhaps this is because it isn't yet finished, but I don't know.

We'll have to wait and see...then there will be the issue of getting it onto a disk.

.

Khukuri Monster · Aug 15, 2005

Any webspider that you try has to have a feature which lets you restrict what it is trying to download. Otherwise it won't work well. The problems lie in the way that the forum software works... the pages aren't sitting static on some hard drive but are dynamically generated by a request.

So a webspider that doesn't let you filter what it downloads is going to download all sorts of junk. Think of all the sorts of ways that you can sort any given page on BF... you can display 25 posts per page, or 26, or 27, etc. and you can sort them ascending, descending, by last post date, etc. When you combine all that you will get millions of possible combinations and the webspider will end up downloading millions of pages with the same (to humans, at least) content.

1.5 GB worth of downloaded pages might represent all of the HI forum. Or it might be only a few hundred real threads. It all depends how much junk was downloaded.

What you need to find is a webspider with regular expression matching. This will let you match the particular "signature" that a link to a thread has. I used regular expressions in the program I wrote to pull the proper links off of the bladeforums pages.

Right now I've downloaded all old posts up to 5-27-2002. This is about 100 MB and 1,100 files. I would have downloaded all the threads older than 3 years but at some point last night my dad shut off my computer without me knowing and it reset the progress. I should have them all downloaded tonight though.

I have also made a utility which strips the non-necessary forum code out of the downloaded files and makes them a lot smaller and more compact.

Khukuri Monster · Aug 15, 2005

A couple of things I thought of while reading the HTTrack documentation:

You can filter what you want to download. In our case, we only want to download threads, so a filter like: "http://www.bladeforums.com/forums/showthread.php?t=*" will work well to cut down on junk.

If you start the crawl on the first thread in the forum, and then have the program keep on crawling to the "next thread" (note the link at the bottom of each thread you view), it will eventually reach them all, and probably not be downloading any junk if the filters are set properly.

I'll download the program myself and see if I can get a combination of settings and filters which will work well.

Do we need a historian? It's pruning time.

Nasty

Chief Cook & Bottle Wash

olpappy

Vampire Hunter D

Khukuri Monster

Nasty

Chief Cook & Bottle Wash

Dave Rishar

Berkley

Kismet

Nasty

Chief Cook & Bottle Wash

Yvsa

Kismet

Khukuri Monster

Khukuri Monster

Khukuri Monster

Khukuri Monster

ddean

ddean

Nasty

Chief Cook & Bottle Wash

Khukuri Monster

Khukuri Monster