Select distinct Data with certain text
-
Thursday, December 27, 2012 7:44 AM
I have a content in which i have certain page name with ended with .aspx
href="http://www.mysite.com/MyPage.aspx"
I wanted to list of all such distinct entries in all of my content
how can i retrieve them.
At the moment it simply selected all content
SELECT distinct [MYContents]
FROM [dbo].[MYTable]
WHERE [MYContents] LIKE '%.aspx%'
My DB is sql server 2005
Kamran Shahid Principle Engineer Development (MCP,MCAD,MCSD.NET,MCTS,MCPD.net[web])
- Edited by Kamran Shahid Thursday, December 27, 2012 8:02 AM
All Replies
-
Thursday, December 27, 2012 7:53 AMModerator
Perhaps:
SELECT DISTINCT [MYContents] FROM [dbo].[MYTable] WHERE [MYContents] LIKE '%.aspx%'
Kalman Toth SQL 2008 GRAND SLAM
New Book: SQL Programming & Database Design Using Microsoft SQL Server 2012 -
Thursday, December 27, 2012 8:04 AMI need to get full href="http://www.mysite.com/MyPage.aspx" entries with in the content whcih also have other body element
Kamran Shahid Principle Engineer Development (MCP,MCAD,MCSD.NET,MCTS,MCPD.net[web])
-
Thursday, December 27, 2012 8:32 AM
This TVF may help to extract links
create function dbo.GetLinks (@t nvarchar(max)) returns @Links table (link nvarchar(max)) as begin declare @strtpos int = PATINDEX('%href="http%.aspx"%', @t) declare @endpos int = 0 declare @lnk nvarchar(max) while @strtpos > 0 begin select @endpos = PATINDEX('%.aspx"%', @t)+ 5 , @lnk = substring(@t ,@strtpos, @endpos - @strtpos) , @t= RIGHT (@t, len(@t) - @endpos) , @strtpos = PATINDEX('%href="http%.aspx"%', @t) insert @Links values(@lnk ) end return end go -- test-- declare @arg nvarchar(max) set @arg= 'I have a content in which i have certain page name with ended with .aspx href="http://www.mysite.com/MyPage.aspx" I wanted to list of all such distinct entries in all of my content href="http://www2.mysite.com/My444Page.aspx" ' select * from dbo.GetLinks(@arg)Hope no links like href="http://www2.aspx.my.com/My444Page.aspx" or somthing. Then you'll need full feartured HTML paser.
Serg
- Edited by SergNL Thursday, December 27, 2012 8:45 AM
- Proposed As Answer by Naomi NMicrosoft Community Contributor, Moderator Friday, December 28, 2012 1:02 AM
- Unproposed As Answer by Kamran Shahid Friday, December 28, 2012 5:03 AM
-
Thursday, December 27, 2012 9:32 AM
Thanks a lot
Corrected with minor tweak for sql server 2005
create function dbo.GetLinks (@t nvarchar(max))
returns @Links table (link nvarchar(max))
as
begin
declare @strtpos int
SET @strtpos = PATINDEX('%href="http%.aspx"%', @t)
declare @endpos int
SET @endpos = 0
declare @lnk nvarchar(max)
while @strtpos > 0
begin
select @endpos = PATINDEX('%.aspx"%', @t)+ 5
, @lnk = substring(@t ,@strtpos, @endpos - @strtpos +1)
, @t= RIGHT (@t, len(@t) - @endpos)
, @strtpos = PATINDEX('%href="http%.aspx"%', @t)
insert @Links values(@lnk )
end
return
end
goNow how to retrieve it for all records.From
declare @arg nvarchar(max) set @arg= 'I have a content in which i have certain page name with ended with .aspx href="http://www.mysite.com/MyPage.aspx" I wanted to list of all such distinct entries in all of my content href="http://www2.mysite.com/My444Page.aspx" ' select * from dbo.GetLinks(@arg)
I can get it for only one record
Kamran Shahid Principle Engineer Development (MCP,MCAD,MCSD.NET,MCTS,MCPD.net[web])
- Edited by Kamran Shahid Thursday, December 27, 2012 9:48 AM
-
Thursday, December 27, 2012 10:44 AM
I have tried with
SELECT DISTINCT link
FROM [dbo].[MYTable]
CROSS APPLY dbo.GetLinks([MYContents])Thanks
May be some problem in function or data as i am getting error
Msg 536, Level 16, State 5, Line 1
Invalid length parameter passed to the SUBSTRING function.Remember i may have multiple href
Kamran Shahid Principle Engineer Development (MCP,MCAD,MCSD.NET,MCTS,MCPD.net[web])
-
Thursday, December 27, 2012 11:33 AM
Cann't reproduce the error the script runs OK:
select z.* from (select 'I have a content in which i have certain page name with ended with .aspx href="http://www.mysite.com/MyPage.aspx" I wanted to list of all such distinct entries in all of my content href="http://www2.mysite.com/My444Page.aspx"' as content) x cross apply dbo.GetLinks(x.content) z
Can you show your Content value where it fails?
Serg
- Edited by SergNL Thursday, December 27, 2012 11:35 AM
-
Thursday, December 27, 2012 2:44 PM
select z.* from
(select 'href="MyFirstPAge.aspx"
href="http://www.test.com/2009/05/aa-bb.html"
href="MySecondPage.aspx"' as content) x
cross apply dbo.GetLinks(x.content) zis the sample of error data
Kamran Shahid Principle Engineer Development (MCP,MCAD,MCSD.NET,MCTS,MCPD.net[web])
-
Friday, December 28, 2012 12:59 AM
I think your function is not working properly. It is showing error near substring inside your function.
create function dbo.GetLinks (@t nvarchar(max))
returns @Links table (link nvarchar(max))
as
begin
declare @strtpos int
SET @strtpos = PATINDEX('%href="http%.aspx"%', @t)
declare @endpos int
SET @endpos = 0
declare @lnk nvarchar(max)
while @strtpos > 0
begin
select @endpos = PATINDEX('%.aspx"%', @t)+ 5
, @lnk = substring(@t ,@strtpos, @endpos - @strtpos +1) -- here is the error
, @t= RIGHT (@t, len(@t) - @endpos)
, @strtpos = PATINDEX('%href="http%.aspx"%', @t)
insert @Links values(@lnk )
end
return
end
goESHANI. Please click "Mark As Answer" if a post solves your problem or "Vote As Helpful" if a post has been useful to you
-
Friday, December 28, 2012 5:02 AM
I know about the error position but doesn't know how to correct it.
Any help SergNL
Kamran Shahid Principle Engineer Development (MCP,MCAD,MCSD.NET,MCTS,MCPD.net[web])
-
Friday, December 28, 2012 7:18 AM
As it was said early, this simple script is suitable only if all references are of the form
href="http ... .aspx".
I.e. every link must start with 'href="http' and end with '.aspx"'.
Otherwise you need HTML parser which is capabe to detect any link (and precisely links only, for example exclude commented HTML) according to full HTML syntax. Google says there's a number of free C# parsers to be found on the net. Such a parser can be refactored to CLR TVF http://msdn.microsoft.com/en-us/library/ms131077(v=sql.100).aspx
Serg
- Edited by SergNL Friday, December 28, 2012 7:22 AM
- Proposed As Answer by Naomi NMicrosoft Community Contributor, Moderator Friday, December 28, 2012 4:27 PM
- Marked As Answer by Kalman TothMicrosoft Community Contributor, Moderator Saturday, January 05, 2013 11:09 PM
-
Friday, December 28, 2012 7:30 AM
Does this site message parser detects HTML comments in messages?
<!--This is a comment. http://www.w3schools.com/tags/tag_comment.asp Comments are not displayed in the browser-->
No, it's not intented to do such a tricks.
Serg
- Edited by SergNL Friday, December 28, 2012 7:33 AM
-
Friday, December 28, 2012 7:51 AMYou can post it in ASP.NET Forum
-
Friday, December 28, 2012 9:23 AM
create table #tb(id int,arg nvarchar(max))
insert into #tb select 1,'I have a content in which i have certain page name with ended with .aspx href="http://www.mysite.com/MyPage.aspx" I wanted to list of all such distinct entries in all of my content href="http://www2.mysite.com/My444Page.aspx" ' union all select 2,'href="http://www.mysite2.com/MyPage.aspx" I wanted to list of all such distinct entries in all of my content href="http://www2.mysite2.com/My555Page.aspx" ' union all select 3,'test3 a.aspx href="http://www.google.com/MyPage.aspx" I wanted to list of all such distinct entries in all of my content' union all select 4,'I have a test with .aspx href="http://www.mysite4.com/MyPage.aspx" and href="http://www2.mysite4.com/My666Page.aspx" 'union all select 5,'href="http://www.mysite4.com/MyPage.aspx" and href="http://www2.mysite4.com/My666Page.aspx" or href="http://www2.mysite4.com/My777Page.aspx"' ;with cte as ( select ID,stuff(arg,1,charindex('aspx"',arg,patindex('%href="http%.aspx"%',arg))+5,'') as arg ,substring(arg ,patindex('%href="http%.aspx"%',arg)+6 ,charindex('aspx"',arg,patindex('%href="http%.aspx"%',arg))-patindex('%href="http%.aspx"%',arg)-2) as httplink ,patindex('%href="http%.aspx"%',arg) as start ,charindex('aspx"',arg,patindex('%href="http%.aspx"%',arg)) as [end] from #tb union all select ID,stuff(arg,1,charindex('aspx"',arg,patindex('%href="http%.aspx"%',arg))+5,'') as arg ,substring(arg ,patindex('%href="http%.aspx"%',arg)+6 ,charindex('aspx"',arg,patindex('%href="http%.aspx"%',arg))-patindex('%href="http%.aspx"%',arg)-2) as httplink ,patindex('%href="http%.aspx"%',arg) as start ,charindex('aspx"',arg,patindex('%href="http%.aspx"%',arg)) as [end] from cte a where patindex('%href="http%.aspx"%',arg)>0 ) select ID, httplink from cte order by id
Using CTE to parse HTML.
-----------------------------
1 http://www.mysite.com/MyPage.aspx
1 http://www2.mysite.com/My444Page.aspx
2 http://www2.mysite2.com/My555Page.aspx
2 http://www.mysite2.com/MyPage.aspx
3 http://www.google.com/MyPage.aspx
4 http://www.mysite4.com/MyPage.aspx
4 http://www2.mysite4.com/My666Page.aspx
5 http://www.mysite4.com/MyPage.aspx
5 http://www2.mysite4.com/My666Page.aspx
5 http://www2.mysite4.com/My777Page.aspx -
Friday, December 28, 2012 11:01 AM
Thanks Barton But got same error
;With cte As (Select MyDataID , Stuff(MyDataContents, 1, Charindex('aspx"', MyDataContents, Patindex('%href="http%.aspx"%', MyDataContents)) + 5, '') As MyDataContents , Substring(MyDataContents, Patindex('%href="http%.aspx"%', MyDataContents) + 6, Charindex('aspx"', MyDataContents, Patindex('%href="http%.aspx"%', MyDataContents)) - Patindex('%href="http%.aspx"%', MyDataContents) - 2) As httplink , Patindex('%href="http%.aspx"%', MyDataContents) As start , Charindex('aspx"', MyDataContents, Patindex('%href="http%.aspx"%', MyDataContents)) As [end] From MyDatas Union All SELECT MyDataID , Stuff(MyDataContents, 1, Charindex('aspx"', MyDataContents, Patindex('%href="http%.aspx"%', MyDataContents)) + 5, '') As MyDataContents , Substring(MyDataContents, Patindex('%href="http%.aspx"%', MyDataContents) + 6, Charindex('aspx"', MyDataContents, Patindex('%href="http%.aspx"%', MyDataContents)) - Patindex('%href="http%.aspx"%', MyDataContents) - 2) As httplink , Patindex('%href="http%.aspx"%', MyDataContents) As start , Charindex('aspx"', MyDataContents, Patindex('%href="http%.aspx"%', MyDataContents)) As [end] From cte a Where Patindex('%href="http%.aspx"%', MyDataContents) > 0) Select MyDataID , httplink From cte Order By MyDataID
Kamran Shahid Principle Engineer Development (MCP,MCAD,MCSD.NET,MCTS,MCPD.net[web])
-
Friday, December 28, 2012 11:31 AM
2 Barton
Yes, it's a good example how to convert WHILE into recursive CTE.
And it will do the job. But alias, the same job ( plus it will not fail on some data). It will extract the link 'href="http://mysinte.com"_contains many".aspx"' from a content, type of 'My site href="http://mysinte.com"_contains many".aspx" pages.'
Which is hardly what is needed.
Serg
-
Friday, December 28, 2012 11:42 AM
My sample data is
<a shape="rect" href="http://www.qwewqe.fda.gov/wqewqe/pdf8/wewe.pdf" target="_blank">System</a> rket. </p>been calling for <a shape="rect" href="http://www.sdfdsf.org/news/sdfsdf/sdfsdfsdf.html" target="_blank">rocess</a>,in 2010, <a shape="rect" href="http://www.gfhgfh.com/defective-medical-devices/fghfgh-takes-aim-at-501k-fgh-device-approvals" target="_blank"> changes. </p><p style="text-align: justify">The <a shape="rect" href="fghfghfgh.aspx" target="_blank"> substantially equivalent.</a> </p><p style="text-align: justify">Per th </p><p style="text-align: justify">market. </p> <p style="text-align: justify">The <a shape="rect" href="asdd.aspx" target="_blank">sub.</a> </p><p style="text-align: justify"> Perhaps</p><p style="text-align: justify">t of a <a shape="rect" href="http://www.asdasdasd.com/view.aspx?rid=67920" target="_blank">one-in-eight</a> revision </p><p style="text-align: justify">Critics de <a shape="rect" href="http://www.sdfsdf.com/2009/05/fda-sdfsdf-devices-without-scrutiny-putting-dsf-at-risk.html" target="_blank">5 </p><p style="text-align: justify">We </p><p style="text-align: justify">If </p><p style="text-align: justify"> <a shape="rect" href="SDFSDFSDF.aspx">sdfsdf</a> </p><p style="text-align: justify"><a shape="rect" href="asd324234dsdasd.aspx">asd</a> </p><p style="text-align: justify"> <a shape="rect" href="sadsad.aspx">sasdasdasd</a>
Kamran Shahid Principle Engineer Development (MCP,MCAD,MCSD.NET,MCTS,MCPD.net[web])
-
Friday, December 28, 2012 3:26 PM
Try
create function dbo.GetLinks (@t nvarchar(max)) returns @Links table (link nvarchar(max)) as begin declare @TEMPLATE nvarchar(20) set @TEMPLATE = '%href="%"%' declare @strtpos int set @strtpos = PATINDEX(@TEMPLATE, @t) declare @endpos int set @endpos = 0 declare @lnk nvarchar(max) while @strtpos > 0 begin select @endpos = CHARINDEX('"',@t,@strtpos+6) , @lnk = substring(@t ,@strtpos, @endpos - @strtpos + 1) , @t = RIGHT (@t, len(@t)- @endpos) , @strtpos = PATINDEX(@TEMPLATE, @t) insert @Links values(@lnk ) end return endThis works fine on the sample. But again, it does't take into accout a number of HTML features, white spaces for example.
Serg
-
Friday, December 28, 2012 4:14 PM
Thanks a lot sergNL
At the moment done it via .net [extracted result set then find out the data via regex class with certain pattern ]But Will try out your function tomorrow as well
Kamran Shahid Principle Engineer Development (MCP,MCAD,MCSD.NET,MCTS,MCPD.net[web])
-
Friday, December 28, 2012 5:35 PM
You may wish integrate .net regex as SQL object as well
http://www.codeproject.com/Articles/42764/Regular-Expressions-in-MS-SQL-Server-2005-2008
Serg
-
Saturday, December 29, 2012 9:28 AMThanks serg. I know about CLR support thing in sql server but at the moment current task is about some data porting utility. CLR assembly do need extra costing of deployment/maintenance of assembly on the db server so i thought it is fine to deal with the thing in .net as i were already calling the db via .net utility for porting.
Kamran Shahid Principle Engineer Development (MCP,MCAD,MCSD.NET,MCTS,MCPD.net[web])

