PDA

View Full Version : Text file duplicate line remover


John-
10-31-2003, 10:16 PM
well i need this program and i cant program for ****.
i wasnt gonna post and ask for you guys to try and make it but you are all saying about how this place is dying and ****, and nobody cares about the place so i figured i would post a topic which everybody seems to enjoy on here programming :)

anyway back to the point.

i need this program and its not like i havent looked for it, i have infact i even found it many versions of it come to think about it.

i have tried and tested all of them,
and in some way or another they do not work correctly.
either they are completely unaccurate (is that a word? should it be inaccurate? is that a word anyway?)
some programs will say a completely different amount of dupes to other programs.
or the program doesnt save the full text file after removing all the duplicates i encountered this with 3 different programs i think.
it just crashs and does nothing.

those are the most common erors that occured.
the errors occur especially with bigger text files for example anything 2MB+ now when the wordlist is loading usually it will take a while it can even take like 45 seconds (i dont mind that)

i have been told by a few people this is an easy program to make but because of all the different programs i have tried and none of them work absolutely 100% i put this in the intermmediate section.

i even emailed a programmer asking if he was able to make this kind of program.
the guy i emailed is a very well experienced programmer who has made many programs using text files in the past.

this is what i wrote to the programmer.

i was wondering if anytime in the near future you had any
plans to create a text file duplicate line remover program. there
are many programs which claim to do this that i have tried, but
they are either not accurate with removing duplicates, save the
file with lots of the list missing or the program seems to crash.
especially with large text files.

this is what he replied

I did something in the past in that way, but gave it up when things
kept going wrong. Although everything seemed easy at the start, more
and more barriers came up that were difficult to overcome/handle -at
least- in a reliable way. (Not all work was wasted, because I could
use parts of it to create TABS2spaces :-)

--
Best regards,
David.


i figured if an experienced programmer like him couldnt do it not many people could.
unless he isnt as experienced as i thought.

but there are lots of great programmers here so i thought give it a shot and ask.

anyway some of you might give a cra.p but i figured some of you might.
some of you might even want to do it as a small team project just an idea anyway.

aUsTiN
10-31-2003, 11:43 PM
Well, You Cant Try Catcher. Its A Simple Dup Remover I Made. As Far As Holding That Much Info In A Listbox, heh Good Luck!

+ txt Files Wont Hold That Much & Become .doc's

Then You Encounter Opening Them, Causes Lag, & If For Instance On My Computer, If The .doc Or .sql Is Over 9 Megs, Wordpad Nor Notepad Will Open It. So I Have To Turn To Dreamweaver Which Is By Far A RAM Hog.

Anyways Point being, Dup Removers Are Not The Hard Part, Storing That Much Data In A Listbox, Is.

baloney_mahoney
11-01-2003, 12:59 AM
Originally posted by aUsTiN
.........Anyways Point being, Dup Removers Are Not The Hard Part, Storing That Much Data In A Listbox, Is.

So, dont use a Listbox or a Textbox. There are other ways to store extermly large amounts of data in your program.

nscopex
11-01-2003, 04:09 AM
Are these single lines of text? If its a text file of

line1
line2
line3

Thats not at all hard to make and I could whip you somethign up fast. But as for removing dupes in another way, youll have to fill me in.

John-
11-01-2003, 09:55 AM
nscopex that is how it would be.
and austin is right man opening large text files on my computer is difficult aswell.
i tried catcher it is the most accurate dupe remover i have come accross so far nothing beats it :). but i can only get it to save as 510KB no matter how big the text file is.

baloney_mahoney
11-01-2003, 12:42 PM
QUOTE:

"Are these single lines of text? If its a text file of

line1
line2
line3"


That is 3 lines of text no matter what or where they are and if you add 1 more......

line1
line2
line3
line4

.......then its 4 lines of text.

aUsTiN
11-01-2003, 08:58 PM
Well What Would You Suggest baloney?

You Say There Are Others,

Listbox
ComboBox
Rtb
Textbox

I Mean There Are Only So Many & Tellin Me There Are Other Doesn't Fix The Fact They Wont Hold Huge Amounts Of Data.

John-
11-01-2003, 09:54 PM
what about making the textbox a lot bigger like 3 or 4 times bigger maybe it would load easier because even the scrollbar wouldnt have to move as fast, bare in mind i am a noob so dont flame my lack of knowledge just admire the part that i am using my brain to think lol.

im not sure if blaoney meant ways of doing it without a text box but i dont have a clue.

nscopex
11-02-2003, 01:30 AM
IM ASKING:

Is this a text file with names and password? Or names?

Does it go:

a
b
c
d
a
b
c
d

???

If it does i can have it so it removes the a, b, c, and d. TO look like

a
b
c
d

No matter how big the text file. Not hard. Is that what you want it to do?

aUsTiN
11-02-2003, 08:22 AM
I'm Thinking Random Words / Proxies / W-e In The List.

a
b
d
g
e
h
v
b
e

Etc. & It Removes The Duplicates. The Dup Remover Is Not The Hard Part By Any Means. The List Size Is.

John-
11-02-2003, 08:26 AM
yes basically it is for password cracking using combo lists.

like

john:password
john:eggs
johns:sweets
john:dynamite
john:password

so for example it would remove the duplicate john:password because it is there 2 times.

aUsTiN
11-02-2003, 08:27 AM
Thats Simple. LoL

I'll See If I Can Get Huge Filesize Lists To Load / Save. Since Thats All Ya Really Need...

John-
11-02-2003, 09:10 AM
if you need test files i got some
i might make another post about a text file splitter lol.

i got a huge 57mb text file which is so big i cannot even open, i could probably change it to .doc to get it to open but it would take me forever to split the file into smaller files manually cos i can only copy and paste so much.

baloney_mahoney
11-02-2003, 11:08 AM
Originally posted by aUsTiN
Well What Would You Suggest baloney?

You Say There Are Others,

Listbox
ComboBox
Rtb
Textbox

I Mean There Are Only So Many & Tellin Me There Are Other Doesn't Fix The Fact They Wont Hold Huge Amounts Of Data.

The problem with Visual Basic programmers is that they tend to think only in terms of ListBoxes, TextBoxes, etc. These are fine for medium scale data manipulation.

The TextBox is great because it allows you to add records to it and VB does the sorting for you. Then, you just start from bottom up removing all duplicates in the Listbox. Simple.

But what do you do when you want to deal with, lets say, 50MB, 60MB, or even 100MB of data? Well, you betten get VB out of your mind because it just isnt going to work.

When I said that there are other ways to store large amounts of data in a program I am not necessarily speaking in terms of using Visual Basic.

Think C. Use the 'malloc' method. You can get up to 2GB of memory. Now I am not going to tell you how to go about writting your dup-removinh logic. I am only telling you how to get the memory space to do so. Of, course, you will have to write your own sorting algorythm but if you are serious about a program that will eliminate duplicates in a large scale data file then you better learn something about C and the usage of 'malloc'. If your not serious and you are content with small to medium scale data then VB and the Listbox is your best approach.

Also, you might want to look into the 'VirtualAlloc' API call that can be used in VB.

Here's a method that one of the programmers where I work used to perform duplicate removing.

He took each record from the input file and used its contents as the name of a file. He then created a 0 length file of that name in a directory. When the program tried to create another file of the same name the system rejected the request. So, the end results
was a directory of records that contained no duplicates. He then read the directory entries back into his program and used the name of the files as the names of the entries in his new output file.

I have no comment on this method wheather it's reasonable or not Im just saying thats how he did it and it worked.

John-
11-03-2003, 01:37 PM
finally i have asked my friend ^1ST^ he is a good coder and a nice hacker , and he said he could code this verey easy and if u wanna get some help from him pesonally join his forum or his irc server ,

irc.dynm8.net , and join #crack

also here is his forums address : www.gsmcracking.com



have fun ;)


REGARDS

baloney_mahoney
11-03-2003, 03:53 PM
Well I went there and it's too much b/s to type in and then when you have finished typing in all that they require it doesnt work so screw it.

John-
11-04-2003, 08:35 AM
i didnt make that post lol.
i asked him to look at this topic cos he can program i thought he could lend a hand he said he would, he didnt want to go to trouble of signing up so i lent hiom my user and pass.

i thought he was gonna post some help on this site.

John-
11-05-2003, 02:25 PM
did this topic die?

i dont mind if you guys couldnt do it at least you tried and thats the main thing if i could program i wouldnt have to be a begging bum and ask you guys to do it lol.

baloney_mahoney
11-05-2003, 10:40 PM
Well, I posted my opinion about doing mass data processing in Visual Basic so take it or leave it.

I could do it because I know "C". I also know how to interface VB with "C" but I just dont have the time to mess around with it at this time. Someday in the futher when I want to get into dealing with hugh amounts of IDs and passwords processing I might put one together but I'm not really into that kind of programming now.

John-
11-06-2003, 05:09 AM
i don6t wanna start a flame war but i know why nscopex flamed you in another thread cos whenever there is a topic on programming you post the most useless **** lol.

firstly from what i have heard (even though people say you cant)
you can do practically anything in VB6 maybe you cant do everything but a program like this i am sure you could do.

telling us you are capable doesnt really help at all, this thread was meant as a way of maybe putting some ideas together to help each other figure out a way of handling the large amount of data with/without textboxs and the errorhandling.

but thanks anyway lol

John-
11-06-2003, 05:24 AM
sorry about my last post i was just
pissed of didnt really mean anything by it.

John-
11-06-2003, 05:42 AM
just had a look on google and apparently lots of people have had this same problem with handling this large amount of data from text files.

im not quite sure how to fix it though because the answers people are giving might be understandable to you guys but are like someone is speaking german to me.

i was gonna ask if someone could paste a small sample source of how a text file line duplicate remover would work but i wouldnt understand it anyway i dont think.

anyway i looking for solutions so if you want me to paste some little articles or hints or links just say so and i will.

John-
11-06-2003, 06:22 AM
i was wondering 2 little small things.

if you try to open a file that doesnt exist like
Open "C:\example.txt"
would it automatically create this file or would it just not open anything.

second thing when you are using the
Close #1

does it automatically close and save or just close.

baloney_mahoney
11-06-2003, 09:40 AM
Originally posted by john123456
i was wondering 2 little small things.

if you try to open a file that doesnt exist like
Open "C:\example.txt"
would it automatically create this file or would it just not open anything.

second thing when you are using the
Close #1

does it automatically close and save or just close.

Example:

Open "c:\example.txt" For Input As #1

If the file does not exsist you will get an error.

but.....

Open "c:\example.txt" For Output As #1

If the file does not exsist it will be created. If it does exsist it
will be overwritten.

Close #1

It closes the output file and writes out all data that is in the internal buffer. So, the answer to your question is that it does both. If you exit the program but you do not close the file there is no guarantee that the data remaining in the internal buffer will be written to your file.

Example:

Open "c:\myfile.txt" For Output As #1
Print #1, "This is line 1"
Print #1, "This is line 2"

The Print #1 statement does't really write the data to the file but it writes it to an internal buffer. When the internal buffer is full VB
will take the data from the internal buffer and then write it to your output file.

Here is some VB code snippit for a duplicate remover.

On your Form add:

A List Box, call it 'List1'
A List Box, call it 'List2'
A Command Button 1, Caption="Open File"
A Command Button 2, Caption="Remove Duplicates"
A Command Button 3, Caption="Save New File"

Private Sub Command1_Click()
Dim InputData As String

Open "c:\myfile.txt" For Input As #1

Do While Not EOF(1)
Line Input #1, InputData
If InputData <> vbNullString Then
List1.AddItem InputData
End If
Loop

Close #1

End Sub

Private Sub Command2_Click()
For i = 0 To List1.ListCount - 1
List2.AddItem List1.List(i)
Next i

With List2
For i = .ListCount - 1 To 1 Step -1
If .List(i) = .List(i - 1) Then
.RemoveItem (i)
End If
Next i
End With
End Sub

Private Sub Command3_Click()
Open "c:\DuplicatesRemoved.txt" For Output As #1
For i = 0 To List2.ListCount - 1
Print #1, List2.List(i)
Next i
Close #1
End Sub

By the way, in my previous posts I wasnt trying to be a smart
ass I was just trying to direct someone else in the right direction
if they were really serious about making such a program.

John-
11-06-2003, 01:25 PM
how much data can the programs internal buffer hold roughly or what circumstances does it depend on like if you have lots of ram would the internal buffer in the program hold more data for you than it would for someone with less ram?

also is it better to use the print #1 so it takes the data to the buffer or is it better to use Write #1 just curious.

by the way nice post baloney very nice.

baloney_mahoney
11-06-2003, 02:12 PM
I'm not real sure about the capacity of the internal buffer. It could be a size set by Visual Basic, by the user in a VB option on the IDE, or set by the Operating System. I would think that the more RAM you have the bigger the buffer is but I am not sure. I have never been concerned with it because it doesnt really matter what the size is you cannot use it for data manipulations.

When you do ouput VB places each record in this buffer. When VB sees that it cannot place anymore data in the buffer it writes the data in the buffer to your output file, clears the buffer, then places the new data in it.

The only thing that I can see is that the bigger it is the more efficient and faster your application will run. But I do not know how to change it's size.

Regardless of the size of the internal buffer this has nothing to do with the amount of data that a ListBox can hold. I seem to have read somewhere that VB says that the capacity of a ListBox is dependent on the available memory you have but I have also seen that if you try to load too much data into a ListBox that VB will choke-up and/or processing will boggle down to an almost unbearable situation.

Both the Print #1 and the Write #1 statements place the output
data in an internal buffer.

To illustrate this put the below code in the Form_Load Sub of your VB program and single step through it using the Debug of you VB Project. After several times through theloop go look at the file using Notepad and you will see that there is no data in the file. How many times you have to cycle through the loop before any data in actually written to the file depends on the buffer size and the data size you are writting out.

Open "c:\testfile.txt" For Output As #1
Do While True
Write #1, "This is a test line"
Loop
Close #1


The Print #1 statement is for outputing data to a sequential file verbatim.

The Write #1 statement is used for more complex data structure. For simple text files it could be used but I dont recommend it if your output data contains certain characters because you might wind up with something you didn't expect in your output file.

John-
11-06-2003, 04:21 PM
nice nice another nice informative post.

ok i read a few articles about people that have been making programs for text files to do something or another with the lines in the file(usually something for a database) some of them have fixed the problem with waiting for large text files to load by using arrays for each line or something so it loads multiple lines at a time would this be a good idea for a text file line duplicate remover program (tfldrp i will call it from now on the name is to big.) or because the program would be doing so much at one time especially with such a large amount of data could this cause the program to just freeze up and crash.

to solve the problem with the program not pasting the full file when it saves could it be programmed to only paste a certain amount at a time or only save only a certain amount of lines or KB at a time until the full text file is completely saved.

like start of using Write #1 to save the first time
then when it starts saving the rest it could use Append #1
till the whole file is saved successfully.

baloney_mahoney
11-06-2003, 07:11 PM
One way you might load up a large file is to load it as a binary file.

I dont know what size file you want to load but here's what I was able to do.

I am on Windows 98, 128MB RAM, Pentium-II.

Using the following code:

Dim InputData As String
Dim FileLength As Long

Open "c:\example.txt" For Binary As #1

FileLength = LOF(1)

InputData = String(FileLength, " ")

Get #1, 1, InputData

Close #1

A 10MB file took about 30 seconds.
A 20MB file took about 40 seconds.
A 30MB file took about 60 seconds.

When I tried to load anything bigger than 30MB Visual Basic went south and never returned. I had to Shut Down the program using Task Manager.

All the data is now in the string variable 'InputData'.

Now, the problem is that it is one contiguous string of data and you will need to loop through it extracting out each record.

So, you will need an extract loop something like this:

Dim i1 As Long
Dim i2 As Long

List1.Clear

i1 = 1 ' Set starting search position as 1 (1st character)

Do While True

i2 = InStr(i1, InputData, vbCrLf) 'Find vbCrLf = End of Record

if i2 = 0 Then Exit Do ' If no more records then exit the loop

List1.AddItem Mid(InputData, i1, i2 - i1) ' Add the record to the ListBox

i1 = i2 + 2 ' Make starting point jump over the vbCrLf codes

Loop

All of the above will run faster than opening a text file and
reading each record one at a time.

The problem is not so much getting large amounts of data into your program but loading that data into a ListBox. Remember, I couldn't get any more than 30MB of data into my VB program but that could change depending on the RAM you have and other factors. Even at that, I was not able to put that much data in a ListBox.

But, lets say that you are able to put all this data into a ListBox.

Then once the duplicates are removed then you need to take the data out of the ListBox and write it out to a new file.

So, you might try this loop:

Dim OutputData As String

For i1 = 0 to List2.ListCount-1

OutputData = OutputData & List2.List(i1) & vbCrLf

Next i1

Open "NewFile.txt" For Binary As #1
Put #1, 1, OutputData
Close #1

I don't know if this will do you any good at all.

John-
11-06-2003, 07:50 PM
yea even IF lol big IF
i can load all the data into the
program the thing that i think will
be the hardest is saving the file again

baloney_mahoney
11-06-2003, 08:21 PM
Well, like I said I was able to load up 30MB of data so the rule of thumb is; if you can load it you can save it.

Using the Put #1, 1, OutputData you can save a 50MB data file. I know, I did it.

If you load up a file of say, 20MB and you remove duplicates from this data you should have less than 20MB to write out.

If my boss at worked told me I had to make a program that would do what you want to do and I had to do it in VB I would
use Windows API calls to do the big stuff for me. But here, you have to know and understand APIs. I cant give you a sample of such a program because it gets quite complex but I believe that it is probably the only way to accomplish this in VB and even then it will have some limitations. I believe I would just be wasting my time trying to accomplish this using ListBoxes.

The big problem is that all records must be in memory to perform
a sort and then eliminate duplicates. There are comerical software applications that are very sophicated and they do large scale sorting and dup-eliminating but these are somewhat expensive and are primarly used by companies that can justify the cost. One such program is called 'Sync-Sort'. I think it costs about $150 bucks but it is very good and the company I work for uses it. It can sort many millions of records and eliminate duplicates because it uses very clever algorithms and it takes advantage of virtual memory (vm size can be as large as your hard drive if you so allocate it to the program).

I don't think this is what you want, but honestly, I can not see trying to write a VB program to deal with millions of records and sort them then elimate the duplicates. It's just too much to ask VB to do. VB was never designed to handle this sort of large scale processing.

You know what I think? VB is fantastic for graphics, ease of programming and very easy to understand. But it is really for small time stuff.

John-
11-06-2003, 08:58 PM
yeah just bill gates attempt at commercialising programming lol.

John-
11-24-2003, 03:40 AM
i tried loads and loads of programs to do this like i said in my first post and i somehow missed a great program called raptor.
now in its 3rd full release (raptor 3) it has lots of options it merges splits and removes dupes plus various other tasks.

i have tested it and it works like a charm considering i tested it on a 4mb+ file it was very fast aswell i dont know what language it was made in more than likely C++ but it is a great program and i thought i would let you all know.