nixCraft Linux Forum

nixCraft

Linux / UNIX Tech Support Forum

Script to count unique ips in apache access log

This is a discussion on Script to count unique ips in apache access log within the Getting started tutorials forums, part of the Linux Getting Started category; I have ran the same tests on radio5 host. Of course I copied the access_log to radio5 form forums2. Now ...


Go Back   nixCraft Linux Forum > Linux Getting Started > Getting started tutorials

Linux answers from nixCraft.


Getting started tutorials So much to read, so little time! If that is your problem, we have solution. Read our FAQ and tutorials to help you cut through the clutter of information overload. Only members of "contributors" group can post new tutorials. Other members can just reply to thread.

Reply

 

LinkBack Thread Tools Display Modes
  #11 (permalink)  
Old 12-28-2009, 11:07 PM
jaysunn's Avatar
Powered By Linux
User
 
Join Date: Apr 2009
Location: 41.332032,-73.089775
OS: RHEL - OSX
Posts: 597
Thanks: 61
Thanked 78 Times in 70 Posts
Rep Power: 10
jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold
Default

I have ran the same tests on radio5 host. Of course I copied the access_log to radio5 form forums2. Now the file may have grown slightly since first running. Below is the uname outputs:

Code:
[root@forums2 logs]# uname -a 
Linux forums2.nyc 2.6.9-67.0.7.ELsmp #1 SMP Wed Feb 27 04:48:20 EST 2008 i686 i686 i386 GNU/Linux
Code:
[root@radio5 apache]# uname -a
Linux radio5 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:28:02 EDT 2006 i686 i686 i386 GNU/Linux
Results:

Code:
[root@radio5 apache]# FILE=/usr/local/apache/access_log
[root@radio5 apache]# time cut -d ' ' -f 1 "$FILE" | sort | uniq -c
   6603 10.4.20.236
      1 173.10.18.115
      1 187.61.17.37
      4 217.24.240.68
     14 41.223.30.22
      3 61.160.216.63
 159484 67.72.16.xxx
   6631 67.72.16.xxx
 159480 67.72.16.xxx
     10 75.148.211.109
      2 78.138.151.126

real	0m6.871s
user	0m6.882s
sys	0m0.035s
[root@radio5 apache]# time awk '{!a[$1]++}END{for(i in a) if ( a[i] >10 ) print a[i],i }' $FILE
159480 67.72.16.xxx
14 41.223.30.22
159484 67.72.16.xxx
6603 10.4.20.236
6631 67.72.16.xxx

real	0m0.272s
user	0m0.255s
sys	0m0.017s
Regards,

Jaysunn
__________________
Have a look at what I have been working on
http://www.shellasaurus.com

Last edited by jaysunn; 12-28-2009 at 11:09 PM. Reason: Added that I copied the same file to server2 for testing.
Reply With Quote
  #12 (permalink)  
Old 12-29-2009, 04:29 AM
Member
User
 
Join Date: Sep 2006
Posts: 66
Thanks: 0
Thanked 20 Times in 16 Posts
Rep Power: 6
ghostdog74 has a spectacular aura about ghostdog74 has a spectacular aura about ghostdog74 has a spectacular aura about
Default

Quote:
Originally Posted by cfajohnson View Post

If it isn't obvious (the code has to be interpreted every time through an explicit loop), then use the time command to test it.
the for loop is only run at the END block of awk after processing the last line. if this is not it, then i still don't know what you are talking about.

Last edited by ghostdog74; 12-29-2009 at 04:36 AM.
Reply With Quote
  #13 (permalink)  
Old 12-29-2009, 04:35 AM
Member
User
 
Join Date: Sep 2006
Posts: 66
Thanks: 0
Thanked 20 Times in 16 Posts
Rep Power: 6
ghostdog74 has a spectacular aura about ghostdog74 has a spectacular aura about ghostdog74 has a spectacular aura about
Default

Quote:
Originally Posted by nixcraft View Post
I've not tested this but it is *possible* that results are cached by kernel. Can you run it on two different hosts with same data file and post it back?
its pretty obvious even without using time command, because of the extra pipe to sort + uniq. whereas with the awk command, the count is already done and stored in memory AS it reads through the file. Now imagine the file is >100MB in size...

Last edited by ghostdog74; 12-29-2009 at 04:37 AM.
Reply With Quote
  #14 (permalink)  
Old 12-29-2009, 04:46 AM
Member
User
 
Join Date: Sep 2006
Posts: 66
Thanks: 0
Thanked 20 Times in 16 Posts
Rep Power: 6
ghostdog74 has a spectacular aura about ghostdog74 has a spectacular aura about ghostdog74 has a spectacular aura about
Default

Quote:
Originally Posted by jaysunn View Post
Hmm,


Code:
[root@forums2 ~]# time cut -d ' ' -f 1 "$FILE" | sort | uniq -c
   6585 10.4.20.236
      1 173.10.18.115
      1 187.61.17.37
      4 217.24.240.68
     14 41.223.30.22
      3 61.160.216.63
 159051 67.72.16.xxx
   6613 67.72.16.xxx
 159047 67.72.16.xxx
     10 75.148.211.109
      2 78.138.151.126

real	0m6.954s
user	0m6.952s
sys	0m0.055s
[root@forums2 ~]# time awk '{!a[$1]++}END{for(i in a) if ( a[i] >10 ) print a[i],i }' $FILE
159070 67.72.16.xxx
14 41.223.30.22
159074 67.72.16.xxx
6586 10.4.20.xxx
6614 67.72.16.xxx

real	0m0.214s
user	0m0.201s
sys	0m0.014s
[root@forums2 ~]#
Second one was pretty fast.


Jaysunn
you are doing it with different criteria. the awk command checks for counts more than 10, whereas the cut one did not. adjust that to compare again.
Reply With Quote
  #15 (permalink)  
Old 12-29-2009, 04:47 AM
jaysunn's Avatar
Powered By Linux
User
 
Join Date: Apr 2009
Location: 41.332032,-73.089775
OS: RHEL - OSX
Posts: 597
Thanks: 61
Thanked 78 Times in 70 Posts
Rep Power: 10
jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold
Default

Just for the record.

GhostDog. Thank you for paying attention to the OP which is me and my requirements of any unique IP being less then 10 disregard.

And as for access log size:

Code:
[root@forums1 logs]# ls -lah 
total 296M
drwxr-xr-x   3 root root 4.0K Dec 27 11:10 .
drwxr-xr-x  12 root root 4.0K Nov  5 20:17 ..
-rw-r--r--   1 root root 245M Dec 28 19:14 access_log
I thank you both for the info I have gained. However I have changed my script to the awk version.


Thanks again to you all,

Jaysunn
__________________
Have a look at what I have been working on
http://www.shellasaurus.com
Reply With Quote
  #16 (permalink)  
Old 12-29-2009, 04:56 AM
jaysunn's Avatar
Powered By Linux
User
 
Join Date: Apr 2009
Location: 41.332032,-73.089775
OS: RHEL - OSX
Posts: 597
Thanks: 61
Thanked 78 Times in 70 Posts
Rep Power: 10
jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold
Default

@ghostdog74

Just wondering,

If I add more criteria to the cut command provided my Mr. Johnson. Wouldn't it become slower?

After all, I am learning. Please give me the commands you would like to test and I will be sure to execute them....................


Jaysunn
__________________
Have a look at what I have been working on
http://www.shellasaurus.com
Reply With Quote
  #17 (permalink)  
Old 12-29-2009, 06:11 AM
Member
User
 
Join Date: Sep 2006
Posts: 66
Thanks: 0
Thanked 20 Times in 16 Posts
Rep Power: 6
ghostdog74 has a spectacular aura about ghostdog74 has a spectacular aura about ghostdog74 has a spectacular aura about
Default

Quote:
Originally Posted by jaysunn View Post
If I add more criteria to the cut command provided my Mr. Johnson. Wouldn't it become slower?
criteria like what? for such qns, its best to test it out (even if i can tell you straight away). after all, the shell is there waiting for you to learn about it. run the time command on your big access log and see ....
Reply With Quote
  #18 (permalink)  
Old 12-29-2009, 09:02 AM
nixcraft's Avatar
Never say die
User
 
Join Date: Jan 2005
Location: BIOS
OS: RHEL
Scripting language: Bash and Python
Posts: 2,697
Thanks: 11
Thanked 243 Times in 183 Posts
Rep Power: 10
nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute
Default

I run it over 1Gb log file (on two different hosts) and clearly as ghostdog74 said it is extra pipes that is taking all the time with cut and uniq commands. awk is way to go to sort out your problem.

In theory C program can improve speed a bit, but I highly doubt that too as awk is already optimized for this kind of work.

@jay, use two different hosts (i.e run cut/uniq on serverA and awk on ServerB with same hardware+os+kernel) with same data file. Otherwise server will cache the result for frequently used files in RAM and it will skip disk I/O second time you run awk. This will give you exact result. As, I said earlier, I did this and awk is way faster...
__________________
Vivek Gite
Linux Evangelist
Be proud RHEL user, and let the world know about your enterprise choices! Join RedHat user group.
Always use CODE tags for posting system output and commands!
Do you run a Linux? Let's face it, you need help

Last edited by nixcraft; 12-29-2009 at 09:29 AM.
Reply With Quote
  #19 (permalink)  
Old 12-29-2009, 09:35 AM
kumarat9pm's Avatar
Senior Member
User
 
Join Date: Jun 2007
Location: Pune,MH,India
OS: RHEL,UBUNTU..
Posts: 437
Thanks: 20
Thanked 20 Times in 18 Posts
Rep Power: 5
kumarat9pm has a spectacular aura about kumarat9pm has a spectacular aura about kumarat9pm has a spectacular aura about
Send a message via Yahoo to kumarat9pm Send a message via Skype™ to kumarat9pm
Default

This is one of the deepest admin work related discussion i have come across after i joined nixcraft. This prompted me to keep in mind when writing scripts..
1)code should be as small as possible to achieve our goal, irrespective of lang used.
2)Time taken to run a script should be less.. so that system resources are used minimal.
Thanks to all the guys for your valuable suggestions/sharing of knowledge.
__________________
Thanks,
Surendra Kumar Anne
Ubuntu: Simple, Stylish and Striking..!
Linux: Fast, friendly, flexible and .... free!
Support Open source.
Reply With Quote
  #20 (permalink)  
Old 12-29-2009, 07:10 PM
jaysunn's Avatar
Powered By Linux
User
 
Join Date: Apr 2009
Location: 41.332032,-73.089775
OS: RHEL - OSX
Posts: 597
Thanks: 61
Thanked 78 Times in 70 Posts
Rep Power: 10
jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold
Default

OK and not to beat a dead horse. However I had to show the results of a cut command that performed the exact output and stipulations of the awk command. Fresh host never seen any of the commands to prevent kernel caching.

And as everyone suggested, AWK is the champ. Thanks to all for making this happen. I learned a lot.

Code:
[root@server1 testing]# ls -lah
total 255M
drwxr-xr-x   2 root root 4.0K Dec 29 09:18 .
drwxr-x---  15 root root 4.0K Dec 29 09:18 ..
-rw-r--r--   1 root root 255M Dec 29 09:15 access_log
Code:
[root@server1 testing]# FILE=/root/testing/access_log
Code:
[root@server1 testing]# time cut -d ' ' -f 1 "$FILE" | sort | uniq -c | grep '[0-9][0-9] '
 598129 10.4.20.236
 179838 67.72.16.xxx
    215 67.72.16.xxx
   7470 67.72.16.xxx
 414332 67.72.16.xxx
 884701 67.72.16.xxx
 880528 67.72.16.xxx
    379 67.86.131.xxx
    476 68.195.209.xxx
    166 68.195.209.xxx
     38 76.19.14.47

real	2m0.744s
user	2m11.299s
sys	0m0.758s
Code:
[root@server1 testing]# time awk '{!a[$1]++}END{for(i in a) if ( a[i] >10 ) print a[i],i }' access_log 
880528 67.72.16.xxx
414332 67.72.16.xxx
884701 67.72.16.xxx
215 67.72.16.xxx
476 68.195.209.xxx
379 67.86.131.xxx
179838 67.72.16.xxx
166 68.195.209.xxx
38 76.19.14.47
7470 67.72.16.xxx
598129 10.4.20.xxx

real	0m2.756s
user	0m2.489s
sys	0m0.277s
[root@server testing]#
Regards,

Jaysunn
__________________
Have a look at what I have been working on
http://www.shellasaurus.com

Last edited by jaysunn; 01-01-2010 at 09:07 PM.
Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)

 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is On
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads

Thread Thread Starter Forum Replies Last Post
Apache error 403 Permission access on RHEL5 samengr Web servers 2 06-06-2009 02:18 AM
Shell script to count number of lines in file specified by the second command-line seaman77 Shell scripting 1 03-16-2009 07:46 PM
grep command count number of CPU sidebrake Shell scripting 3 09-09-2008 11:26 PM
Set and access apache from DSL / ADSL connection paul555 Web servers 4 07-17-2007 04:38 PM
Debian recovery mode read only access make it write access Donavit Linux software 1 12-30-2005 12:49 AM


All times are GMT +5.5. The time now is 02:03 AM.


Powered by vBulletin® Version 3.8.5 - Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.3.2
©2005-2009 nixCraft. All rights reserved

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38